Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
short-paper

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

Published: 16 August 2017 Publication History

Abstract

Authorship Attribution is a long-standing problem in Natural Language Processing. Several statistical and computational methods have been used to find a solution to this problem. In this article, we have proposed methods to deal with the authorship attribution problem in Bengali. More specifically, we proposed a supervised framework consisting of lexical and shallow features and investigated the possibility of using topic-modeling-inspired features, to classify documents according to their authors. We have created a corpus from nearly all the literary works of three eminent Bengali authors, consisting of 3,000 disjoint samples. Our models showed better performance than the state-of-the-art, with more than 98% test accuracy for the shallow features and 100% test accuracy for the topic-based features. Further experiments with GloVe vectors [Pennington et al. 2014] showed comparable results, but flexible patterns based on content words and high-frequency words [Schwartz et al. 2013] failed to perform as well as expected.

References

[1]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.
[2]
Tenenbaum Blei, Griffiths and Jordan. 2004. Hierarchical topic models and the nested Chinese restaurant process. Adv. Neural Info. Process. Syst. 16 (2004), 17.
[3]
Victoria Bobicev, Marina Sokolova, Khaled El Emam, and Stan Matwin. 2013. Authorship attribution in health forums. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). INCOMA Ltd. Shoumen, Bulgaria, 74--82.
[4]
Dasha Bogdanova and Angeliki Lazaridou. 2014. Cross-language authorship attribution. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).
[5]
Tanmoy Chakraborty. 2012. Authorship identification using stylometry analysis in Bengali literature. CoRR abs/1208.6268 (2012). http://arxiv.org/abs/1208.6268
[6]
Suprabhat Das and Pabitra Mitra. 2011. Author identification in Bengali literary works. In Pattern Recognition and Machine Intelligence, Sergei O. Kuznetsov, Deba P. Mandal, Malay K. Kundu, and Sankar K. Pal (Eds.). Lecture Notes in Computer Science, Vol. 6744. Springer, Berlin, 220--226.
[7]
Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5, Supplement (2008), S42--S51.
[8]
Siladitya Jana. 2015. Sister Nivedita’s influence on J. C. Bose’s writings. J. Assoc. Info. Sci. Technol. 66, 3 (2015), 645--650.
[9]
Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334.
[10]
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 1 (Jan. 2009), 9--26.
[11]
Shibamouli Lahiri and Rada Mihalcea. 2013. Authorship attribution using word network features. CoRR abs/1311.2978 (2013). http://arxiv.org/abs/1311.2978
[12]
R. Layton, P. Watters, and R. Dazeley. 2010a. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8.
[13]
Robert Layton, Paul Watters, and Richard Dazeley. 2010b. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8.
[14]
Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577--584.
[15]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 3111--3119.
[16]
Frederick Mosteller and David L. Wallace. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Amer. Statist. Assoc. 58, 302 (1963), 275--309.
[17]
Sibansu Mukhopadhyay, Tirthankar Dasgupta, and Anupam Basu. 2012. Development of an online repository of Bangla literary texts and its ontological representation for advance search options. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation Workshop Programme. Citeseer, 93.
[18]
S. Nagaprasad, T. Raghunadha Reddy, P. Vijayapal Reddy, A. Vinaya Babu, and B. VishnuVardhan. 2015. Empirical evaluations using character and word n-grams on authorship attribution for Telugu text. In Intelligent Computing and Applications, Durbadal Mandal, Rajib Kar, Swagatam Das, and Bijaya Ketan Panigrahi (Eds.). Advances in Intelligent Systems and Computing, Vol. 343. Springer, India, 613--623.
[19]
A. Jamal Nasir, Nico Görnitz, and Ulf Brefeld. 2014. An off-the-shelf approach to authorship attribution. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). Dublin City University and Association for Computational Linguistics, 895--904.
[20]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830.
[21]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1532--1543. Retrieved from http://www.aclweb.org/anthology/D14-1162
[22]
Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. Authorship attribution in Bengali language. In Proceedings of the 12th International Conference on Natural Language Processing (ICON’15).
[23]
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494.
[24]
Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06). Association for Computational Linguistics, Stroudsburg, PA, 482--491.
[25]
Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 78--86.
[26]
Jacques Savoy. 2013. Authorship attribution based on a probabilistic topic model. Info. Process. Manage. 49, 1 (2013), 341--354.
[27]
Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1880--1891. http://aclweb.org/anthology/D13-1193
[28]
Santiago Segarra, Mark Eisen, and Alejandro Ribeiro. 2014. Authorship attribution through function word adjacency networks. CoRR abs/1406.4469 (2014). http://arxiv.org/abs/1406.4469
[29]
Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, 264--269.
[30]
Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2014. Authorship attribution with topic models. Volume 40, Issue 2, June 2014 (2014), 269--310.
[31]
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538--556.
[32]
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 306--315.
[33]
Andreas van Cranenburgh. 2012. Literary authorship attribution with phrase-structure fragments. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, 59--63.
[34]
Ying Zhao, Justin Zobel, and Phil Vines. 2006. Using Relative Entropy for Authorship Attribution. Springer, Berlin, 92--105.

Cited By

View all

Index Terms

  1. A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 4
      December 2017
      146 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3097269
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 August 2017
      Accepted: 01 May 2017
      Revised: 01 May 2017
      Received: 01 September 2016
      Published in TALLIP Volume 16, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Authorship attribution
      2. Naive bayes
      3. lexical features
      4. machine learning
      5. topic model

      Qualifiers

      • Short-paper
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 03 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Automatic authorship attribution in Albanian textsPLOS ONE10.1371/journal.pone.031005719:10(e0310057)Online publication date: 22-Oct-2024
      • (2023)Albanian Authorship Attribution Model2023 12th Mediterranean Conference on Embedded Computing (MECO)10.1109/MECO58584.2023.10155046(1-5)Online publication date: 6-Jun-2023
      • (2022)A Survey on Authorship Analysis Tasks and TechniquesSEEU Review10.2478/seeur-2022-010017:2(153-167)Online publication date: 30-Dec-2022
      • (2022)Bengali text document categorization based on very deep convolution neural networkExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.115394184:COnline publication date: 22-Apr-2022
      • (2021)A New Concept of Electronic Text Based on Semantic Coding System for Machine TranslationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/346965521:1(1-16)Online publication date: 2-Nov-2021
      • (2021)DAAB: Deep Authorship Attribution in Bengali2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9533619(1-9)Online publication date: 2021
      • (2021)Authorship Classification in a Resource Constraint Language Using Convolutional Neural NetworksIEEE Access10.1109/ACCESS.2021.30959679(100319-100338)Online publication date: 2021
      • (2021)Deep neural network and model-based clustering technique for forensic electronic mail author attributionSN Applied Sciences10.1007/s42452-020-04127-63:3Online publication date: 18-Feb-2021
      • (2021)Text Classification Using Convolution Neural Networks with FastText EmbeddingHybrid Intelligent Systems10.1007/978-3-030-73050-5_11(103-113)Online publication date: 17-Apr-2021
      • (2020)Detecting Suspicious Texts Using Machine Learning TechniquesApplied Sciences10.3390/app1018652710:18(6527)Online publication date: 18-Sep-2020
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media