short-paper

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

Authors:

Shibamouli Lahiri,

Arindam BiswasAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 16, Issue 4

Article No.: 28, Pages 1 - 15

https://doi.org/10.1145/3099473

Published: 16 August 2017 Publication History

Abstract

Authorship Attribution is a long-standing problem in Natural Language Processing. Several statistical and computational methods have been used to find a solution to this problem. In this article, we have proposed methods to deal with the authorship attribution problem in Bengali. More specifically, we proposed a supervised framework consisting of lexical and shallow features and investigated the possibility of using topic-modeling-inspired features, to classify documents according to their authors. We have created a corpus from nearly all the literary works of three eminent Bengali authors, consisting of 3,000 disjoint samples. Our models showed better performance than the state-of-the-art, with more than 98% test accuracy for the shallow features and 100% test accuracy for the topic-based features. Further experiments with GloVe vectors [Pennington et al. 2014] showed comparable results, but flexible patterns based on content words and high-frequency words [Schwartz et al. 2013] failed to perform as well as expected.

References

[1]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.

[2]

Tenenbaum Blei, Griffiths and Jordan. 2004. Hierarchical topic models and the nested Chinese restaurant process. Adv. Neural Info. Process. Syst. 16 (2004), 17.

[3]

Victoria Bobicev, Marina Sokolova, Khaled El Emam, and Stan Matwin. 2013. Authorship attribution in health forums. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). INCOMA Ltd. Shoumen, Bulgaria, 74--82.

[4]

Dasha Bogdanova and Angeliki Lazaridou. 2014. Cross-language authorship attribution. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).

[5]

Tanmoy Chakraborty. 2012. Authorship identification using stylometry analysis in Bengali literature. CoRR abs/1208.6268 (2012). http://arxiv.org/abs/1208.6268

[6]

Suprabhat Das and Pabitra Mitra. 2011. Author identification in Bengali literary works. In Pattern Recognition and Machine Intelligence, Sergei O. Kuznetsov, Deba P. Mandal, Malay K. Kundu, and Sankar K. Pal (Eds.). Lecture Notes in Computer Science, Vol. 6744. Springer, Berlin, 220--226.

Digital Library

[7]

Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5, Supplement (2008), S42--S51.

Digital Library

[8]

Siladitya Jana. 2015. Sister Nivedita’s influence on J. C. Bose’s writings. J. Assoc. Info. Sci. Technol. 66, 3 (2015), 645--650.

Digital Library

[9]

Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334.

Digital Library

[10]

Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 1 (Jan. 2009), 9--26.

Digital Library

[11]

Shibamouli Lahiri and Rada Mihalcea. 2013. Authorship attribution using word network features. CoRR abs/1311.2978 (2013). http://arxiv.org/abs/1311.2978

[12]

R. Layton, P. Watters, and R. Dazeley. 2010a. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8.

Digital Library

[13]

Robert Layton, Paul Watters, and Richard Dazeley. 2010b. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8.

Digital Library

[14]

Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577--584.

Digital Library

[15]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 3111--3119.

Digital Library

[16]

Frederick Mosteller and David L. Wallace. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Amer. Statist. Assoc. 58, 302 (1963), 275--309.

[17]

Sibansu Mukhopadhyay, Tirthankar Dasgupta, and Anupam Basu. 2012. Development of an online repository of Bangla literary texts and its ontological representation for advance search options. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation Workshop Programme. Citeseer, 93.

[18]

S. Nagaprasad, T. Raghunadha Reddy, P. Vijayapal Reddy, A. Vinaya Babu, and B. VishnuVardhan. 2015. Empirical evaluations using character and word n-grams on authorship attribution for Telugu text. In Intelligent Computing and Applications, Durbadal Mandal, Rajib Kar, Swagatam Das, and Bijaya Ketan Panigrahi (Eds.). Advances in Intelligent Systems and Computing, Vol. 343. Springer, India, 613--623.

[19]

A. Jamal Nasir, Nico Görnitz, and Ulf Brefeld. 2014. An off-the-shelf approach to authorship attribution. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). Dublin City University and Association for Computational Linguistics, 895--904.

[20]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830.

Digital Library

[21]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1532--1543. Retrieved from http://www.aclweb.org/anthology/D14-1162

[22]

Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. Authorship attribution in Bengali language. In Proceedings of the 12th International Conference on Natural Language Processing (ICON’15).

[23]

Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494.

Digital Library

[24]

Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06). Association for Computational Linguistics, Stroudsburg, PA, 482--491.

Digital Library

[25]

Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 78--86.

Digital Library

[26]

Jacques Savoy. 2013. Authorship attribution based on a probabilistic topic model. Info. Process. Manage. 49, 1 (2013), 341--354.

Digital Library

[27]

Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1880--1891. http://aclweb.org/anthology/D13-1193

[28]

Santiago Segarra, Mark Eisen, and Alejandro Ribeiro. 2014. Authorship attribution through function word adjacency networks. CoRR abs/1406.4469 (2014). http://arxiv.org/abs/1406.4469

[29]

Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, 264--269.

Digital Library

[30]

Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2014. Authorship attribution with topic models. Volume 40, Issue 2, June 2014 (2014), 269--310.

Digital Library

[31]

Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538--556.

[32]

Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 306--315.

Digital Library

[33]

Andreas van Cranenburgh. 2012. Literary authorship attribution with phrase-structure fragments. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, 59--63.

[34]

Ying Zhao, Justin Zobel, and Phil Vines. 2006. Using Relative Entropy for Authorship Attribution. Springer, Berlin, 92--105.

Digital Library

Cited By

Misini ACanhasi EKadriu AFetahi E(2024)Automatic authorship attribution in Albanian textsPLOS ONE10.1371/journal.pone.031005719:10(e0310057)Online publication date: 22-Oct-2024
https://doi.org/10.1371/journal.pone.0310057
Misini AKadriu ACanhasi E(2023)Albanian Authorship Attribution Model2023 12th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO58584.2023.10155046(1-5)Online publication date: 6-Jun-2023
https://doi.org/10.1109/MECO58584.2023.10155046
Misini AKadriu ACanhasi E(2022)A Survey on Authorship Analysis Tasks and TechniquesSEEU Review10.2478/seeur-2022-010017:2(153-167)Online publication date: 30-Dec-2022
https://doi.org/10.2478/seeur-2022-0100
Show More Cited By

Index Terms

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
      2. Language resources

Recommendations

Authorship Attribution of Brazilian Literary Texts Through Machine Learning Techniques
Intelligent Systems
Abstract
Authorship attribution is the process of identifying the author of a particular document. This task has been performed by experts in the field. However, with the advancement of natural language processing tools and machine learning techniques, ...
Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features
CYBERC '13: Proceedings of the 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

In this paper the authors investigate the authorship of several short historical texts that are written by ten ancient Arabic travelers: this Arabic dataset, which was collected by the authors in 2011, is called AAAT dataset. Several experiments of ...
Using Lexical Stress in Authorship Attribution of Historical Texts
TSD 2015: Proceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 9302

This paper presents some early results from a comprehensive project, whose goal is to investigate the use of intonation and lexical stress in authorship attribution. We demonstrate how lexical stress patterns extracted from written text can be used to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16, Issue 4

December 2017

146 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3097269

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2017

Accepted: 01 May 2017

Revised: 01 May 2017

Received: 01 September 2016

Published in TALLIP Volume 16, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
260
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Misini ACanhasi EKadriu AFetahi E(2024)Automatic authorship attribution in Albanian textsPLOS ONE10.1371/journal.pone.031005719:10(e0310057)Online publication date: 22-Oct-2024
https://doi.org/10.1371/journal.pone.0310057
Misini AKadriu ACanhasi E(2023)Albanian Authorship Attribution Model2023 12th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO58584.2023.10155046(1-5)Online publication date: 6-Jun-2023
https://doi.org/10.1109/MECO58584.2023.10155046
Misini AKadriu ACanhasi E(2022)A Survey on Authorship Analysis Tasks and TechniquesSEEU Review10.2478/seeur-2022-010017:2(153-167)Online publication date: 30-Dec-2022
https://doi.org/10.2478/seeur-2022-0100
Hossain MHoque MSiddique NSarker I(2022)Bengali text document categorization based on very deep convolution neural networkExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.115394184:COnline publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.eswa.2021.115394
Eddine M(2021)A New Concept of Electronic Text Based on Semantic Coding System for Machine TranslationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/346965521:1(1-16)Online publication date: 2-Nov-2021
https://dl.acm.org/doi/10.1145/3469655
Dipongkor AIslam MKayesh HHossain MAnwar ARahman KRazzak I(2021)DAAB: Deep Authorship Attribution in Bengali2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9533619(1-9)Online publication date: 2021
https://doi.org/10.1109/IJCNN52387.2021.9533619
Hossain MHoque MDewan MSiddique NIslam MSarker I(2021)Authorship Classification in a Resource Constraint Language Using Convolutional Neural NetworksIEEE Access10.1109/ACCESS.2021.30959679(100319-100338)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3095967
Apoorva KSangeetha S(2021)Deep neural network and model-based clustering technique for forensic electronic mail author attributionSN Applied Sciences10.1007/s42452-020-04127-63:3Online publication date: 18-Feb-2021
https://doi.org/10.1007/s42452-020-04127-6
Hossain MHoque MSarker I(2021)Text Classification Using Convolution Neural Networks with FastText EmbeddingHybrid Intelligent Systems10.1007/978-3-030-73050-5_11(103-113)Online publication date: 17-Apr-2021
https://doi.org/10.1007/978-3-030-73050-5_11
Sharif OHoque MKayes ANowrozy RSarker I(2020)Detecting Suspicious Texts Using Machine Learning TechniquesApplied Sciences10.3390/app1018652710:18(6527)Online publication date: 18-Sep-2020
https://doi.org/10.3390/app10186527
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents