Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3098954.3104050acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaresConference Proceedingsconference-collections
research-article

On the Usefulness of Compression Models for Authorship Verification

Published: 29 August 2017 Publication History

Abstract

Compression models represent an interesting approach for different classification tasks and have been used widely across many research fields. We adapt compression models to the field of authorship verification (AV), a branch of digital text forensics. The task in AV is to verify if a questioned document and a reference document of a known author are written by the same person. We propose an intrinsic AV method, which yields competitive results compared to a number of current state-of-the-art approaches, based on support vector machines or neural networks. However, in contrast to these approaches our method does not make use of machine learning algorithms, natural language processing techniques, feature engineering, hyperparameter optimization or external documents (a common strategy to transform AV from a one-class to a multi-class classification problem). Instead, the only three key components of our method are a compressing algorithm, a dissimilarity measure and a threshold, needed to accept or reject the authorship of the questioned document. Due to its compactness, our method performs very fast and can be reimplemented with minimal effort. In addition, the method can handle complicated AV cases where both, the questioned and the reference document, are not related to each other in terms of topic or genre. We evaluated our approach against publicly available datasets, which were used in three international AV competitions. Furthermore, we constructed our own corpora, where we evaluated our method against state-of-the-art approaches and achieved, in both cases, promising results.

References

[1]
Douglas Bagnall. 2015. Author Identification Using Multi-headed Recurrent Neural Networks, See {5}. http://ceur-ws.org/Vol-1391/150-CR.pdf
[2]
Michael Robert Brennan and Rachel Greenstadt. 2009. Practical Attacks Against Authorship Recognition Techniques. In IAAI, Karen Zita Haigh and Nestor Rychtyckyj (Eds.). AAAI. http://dblp.uni-trier.de/db/conf/iaai/iaai2009.html#BrennanG09
[3]
M. L. Brocardo, I. Traore, S. Saad, and I. Woungang. 2013. Authorship Verification for Short Messages Using Stylometry. In 2013 International Conference on Computer, Information and Telecommunication Systems (CITS). 1--6.
[4]
Linda Cappellato, Nicola Ferro, Martin Halvey, and Wessel Kraaij (Eds.). 2014. Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014. CEUR Workshop Proceedings, Vol. 1180. CEUR-WS.org. http://ceur-ws.org/Vol-1180
[5]
Linda Cappellato, Nicola Ferro, Gareth J. F. Jones, and Eric SanJuan (Eds.). 2015. Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015. CEUR Workshop Proceedings, Vol. 1391. CEUR-WS.org. http://ceur-ws.org/Vol-1391
[6]
Xin Chen, Sam Kwong, and Ming Li. 1999. A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. In Genome Informatics 1999. Japanese Society for Bioinformatics, Universal Academy Press, Tokyo, Japan, 51--61. http://www.jsbi.org/journal1/gi10/
[7]
Rudi Cilibrasi and Paul M. B. Vitányi. 2005. Clustering by Compression. IEEE Trans. Inf. Theor. 51, 4 (April 2005), 1523--1545.
[8]
David Pereira Coutinho and Mário A. T. Figueiredo. 2015. Text Classification Using Compression-Based Dissimilarity Measures. IJPRAI 29, 5 (2015).
[9]
Pamela Forner, Roberto Navigli, Dan Tufis, and Nicola Ferro (Eds.). 2014. Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23-26, 2013. CEUR Workshop Proceedings, Vol. 1179. CEUR-WS.org. http://ceur-ws.org/Vol-1179
[10]
Manuela Hürlimann, Benno Weck, Esther von den Berg, Simon Šuster, and Malvina Nissim. 2015. GLAD: Groningen Lightweight Authorship Detection, See {5}, 12. http://ceur-ws.org/Vol-1391/141-CR.pdf
[11]
Patrick Juola and Efstathios Stamatatos. 2013. Overview of the Author Identification Task at PAN 2013, See {9}, 20. http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-JuolaEt2013.pdf
[12]
Eamonn Keogh, Stefano Lonardi, and Chotirat Ann Ratanamahatana. 2004. Towards Parameter-free Data Mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '04). ACM, New York, NY, USA, 206--215.
[13]
Vlado Keselj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based Author Profiles for Authorship Attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING'03. Dalhousie University, Halifax, Nova Scotia, Canada, 255--264.
[14]
Mahmoud Khonji and Youssef Iraqi. 2014. A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF), See {4}, 977--983. http://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-KonijEt2014.pdf
[15]
Moshe Koppel and Jonathan Schler. 2004. Authorship Verification as a One-Class Classification Problem. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series), Carla E. Brodley (Ed.), Vol. 69. ACM.
[16]
Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Yaron Winter. 2012. The "Fundamental Problem" of Authorship Attribution. English Studies 93, 3 (2012), 284--291.
[17]
Moshe Koppel and Yaron Winter. 2014. Determining if Two Documents are by the Same Author. Journal of the Association for Information Science and Technology 65, 1 (2014), 178--187.
[18]
Ming Li, Jonathan H Badger, Xin Chen, Sam Kwong, Paul Kearney, and Haoyong Zhang. 2001. An Information-based Sequence Distance and its Application to whole Mitochondrial Genome Phylogeny. Bioinformatics 17, 2 (2001), 149--154.
[19]
Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitányi. 2004. The Similarity Metric. IEEE Trans. Inf. Theor. 50, 12 (Dec. 2004), 3250--3264.
[20]
Yuval Marton, Ning Wu, and Lisa Hellerstein. 2005. On Compression-Based Text Classification. In Advances in Information Retrieval, 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005, Proceedings (Lecture Notes in Computer Science), David E. Losada and Juan M. Fernández-Luna (Eds.), Vol. 3408. Springer, 300--314.
[21]
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 43--52.
[22]
Nektaria Potha and Efstathios Stamatatos. 2014. A Profile-Based Method for Authorship Verification. In Artificial Intelligence: Methods and Applications: 8th Hellenic Conference on AI, SETN 2014, Ioannina, Greece, May 15-17, 2014. Proceedings. Springer International Publishing, 313--326.
[23]
Sebastian Ruder, Parsa Ghaffari, and John G. Breslin. 2016. Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution. CoRR abs/1609.06686 (2016). http://arxiv.org/abs/1609.06686
[24]
V. Saikrishna, D. L. Dowe, and S. Ray. 2016. Statistical Compression-based Models for Text Classification. In 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS). 1--6.
[25]
Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of Age and Gender on Blogging. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. AAAI, 199--205. http://dblp.uni-trier.de/db/conf/aaaiss/aaaiss2006-3.html#SchlerKAP06
[26]
D. Sculley and Carla E. Brodley. 2006. Compression and Machine Learning: A New Perspective on Feature Space Vectors. In DCC IEEE Computer Society, 332--332. http://dblp.uni-trier.de/db/conf/dcc/dcc2006.html#SculleyB06
[27]
Shachar Seidman. 2013. Authorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013, See {9}. http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-Seidman2013.pdf
[28]
Efstathios Stamatatos. Authorship Attribution Using Text Distortion. In Proceedings of the 15th Conference of the European Chapter of the Association for the Computational Linguistics, EACL 2017, April 3-7, 2017, Valencia, Spain.
[29]
Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. 2015. Overview of the Author Identification Task at PAN 2015, See {5}, 17. http://ceur-ws.org/Vol-1391/inv-pap3-CR.pdf
[30]
Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Benno Stein, Martin Potthast, Patrick Juola, Miguel A. Sánchez-Pérez, and Alberto Barrón-Cedeño. 2014. Overview of the Author Identification Task at PAN 2014, See {4}, 877--897. http://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-StamatosEt2014.pdf
[31]
Benno Stein, Nedim Lipka, and Sven Meyer zu Eissen. 2008. Meta Analysis within Authorship Verification. In 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy. IEEE Computer Society, 34--39.
[32]
Cor J. Veenman and Zhenshi Li. 2013. Authorship Verification with Compression Features, See {9}, 6. http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-VeenmanEt2013.pdf
[33]
Cor J. Veenman and David Tax. 2005. LESS: A Model-based Classifier for Sparse Subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 9 (Sept. 2005), 1496--1500.

Cited By

View all
  • (2024)Overview of PAN 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification Condensed Lab OverviewExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71908-0_11(231-259)Online publication date: 9-Sep-2024
  • (2022)Global interest meets new perspectivesInternational Journal of Speech, Language and the Law10.1558/ijsll.2138728:2Online publication date: 8-Jul-2022
  • (2022)Authorship Attribution with Temporal Data in RedditProceedings of the XVIII Brazilian Symposium on Information Systems10.1145/3535511.3535515(1-8)Online publication date: 16-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ARES '17: Proceedings of the 12th International Conference on Availability, Reliability and Security
August 2017
853 pages
ISBN:9781450352574
DOI:10.1145/3098954
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Authorship verification
  2. compression models
  3. intrinsic authorship verification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ARES '17
ARES '17: International Conference on Availability, Reliability and Security
August 29 - September 1, 2017
Reggio Calabria, Italy

Acceptance Rates

ARES '17 Paper Acceptance Rate 100 of 191 submissions, 52%;
Overall Acceptance Rate 228 of 451 submissions, 51%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Overview of PAN 2024: Multi-author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification Condensed Lab OverviewExperimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71908-0_11(231-259)Online publication date: 9-Sep-2024
  • (2022)Global interest meets new perspectivesInternational Journal of Speech, Language and the Law10.1558/ijsll.2138728:2Online publication date: 8-Jul-2022
  • (2022)Authorship Attribution with Temporal Data in RedditProceedings of the XVIII Brazilian Symposium on Information Systems10.1145/3535511.3535515(1-8)Online publication date: 16-May-2022
  • (2022)Advancing the Use of Information Compression Distances in Authorship AttributionDisinformation in Open Online Media10.1007/978-3-031-18253-2_8(114-122)Online publication date: 4-Oct-2022
  • (2021)POSNoise: An Effective Countermeasure Against Topic Biases in Authorship AnalysisProceedings of the 16th International Conference on Availability, Reliability and Security10.1145/3465481.3470050(1-12)Online publication date: 17-Aug-2021
  • (2020)Authorship Verification of Yorùbá Blog Posts using Character N-grams2020 International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS)10.1109/ICMCECS47690.2020.246982(1-6)Online publication date: Mar-2020
  • (2020)Authorship verification of opinion articles in online newspapers using the idiolect of author: a comparative studyInformation, Communication & Society10.1080/1369118X.2020.1716039(1-19)Online publication date: 4-Feb-2020
  • (2019)Assessing the Applicability of Authorship Verification MethodsProceedings of the 14th International Conference on Availability, Reliability and Security10.1145/3339252.3340508(1-10)Online publication date: 26-Aug-2019
  • (2019)Similarity Learning for Authorship Verification in Social MediaICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2019.8683405(2457-2461)Online publication date: May-2019
  • (2019)Improving author verification based on topic modelingJournal of the Association for Information Science and Technology10.1002/asi.2418370:10(1074-1088)Online publication date: 2-Sep-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media