Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access
Just Accepted

Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT

Online AM: 22 April 2022 Publication History

Abstract

Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors’ writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable with the increasing number of authors, and performance drops with the small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better with the increasing number of authors, and performance remains steady even with few training samples.

References

[1]
2022. Regularization of Neural Networks Using DropConnect. https://cds.nyu.edu/projects/regularization-neural-networks-using-dropconnect/
[2]
Sumnoon Ibn Ahmad, Lamia Alam, and Mohammed Moshiul Hoque. 2020. An Empirical Framework to Identify Authorship from Bengali Literary Works. In International Conference on Cyber Security and Computer Science. Springer, 465–476.
[3]
Ibrahim Al Azhar, Sohel Ahmed, Md Saiful Islam, and Aisha Khatun. 2021. Identifying Author in Bengali Literature by Bi-LSTM with Attention Mechanism. In 2021 24th International Conference on Computer and Information Technology (ICCIT). IEEE, 1–6.
[4]
DM Anisuzzaman and Abdus Salam. 2018. Authorship attribution for Bengali language using the fusion of N-gram and Naïve bayes algorithms. International Journal of Information Technology and Computer Science (IJITCS) 10, 10 (2018), 11–21.
[5]
Imranul Ashrafi, Muntasir Mohammad, Arani Shawkat Mauree, Galib Md Azraf Nijhum, Redwanul Karim, Nabeel Mohammed, and Sifat Momen. 2020. Banner: A Cost-Sensitive Contextualized Model for Bangla Named Entity Recognition. IEEE Access 8(2020), 58206–58226.
[6]
D Bagnall. 2016. Authorship clustering using multi-headed recurrent neural networks—notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop–Working Notes Papers. 5–8.
[7]
Kurt Barry and Katherine Luna. 2012. Stylometry for online forums.
[8]
Samit Bhattacharya, Monojit Choudhury, Sudeshna Sarkar, and Anupam Basu. 2005. Inflectional morphology synthesis for bengali noun, pronoun and verb systems. In Proc. of the National Conference on Computer Processing of Bangla (NCCPB 05). Citeseer, 34–43.
[9]
Tanmoy Chakraborty. 2012. Authorship identification in bengali literature: a comparative analysis. arXiv preprint arXiv:1208.6268(2012).
[10]
Hemayet Ahmed Chowdhury, Md Azizul Haque Imon, Syed Md Hasnayeen, and Md Saiful Islam. 2019. Authorship Attribution in Bengali Literature using Convolutional Neural Networks with fastText’s word embedding model. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT). IEEE, 1–5.
[11]
Hemayet Ahmed Chowdhury, Md Azizul Haque Imon, and Md Saiful Islam. 2018. Authorship Attribution in Bengali Literature Using fastText’s Hierarchical Classifier. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT). IEEE, 102–106.
[12]
Hemayet Ahmed Chowdhury, Md Azizul Haque Imon, and Md Saiful Islam. 2018. A Comparative Analysis of Word Embedding Representations in Authorship Attribution of Bengali Literature. (2018).
[13]
Hemayet Ahmed Chowdhury, Md Azizul Haque Imon, Anisur Rahman, Aisha Khatun, and Md Saiful Islam. 2019. A continuous space neural language model for bengali language. In 2019 22nd International Conference on Computer and Information Technology (ICCIT). IEEE, 1–6.
[14]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116(2019).
[15]
Piotr Czapla, Sylvain Gugger, Jeremy Howard, and Marcin Kardas. 2019. Universal Language Model Fine-Tuning for Polish Hate Speech Detection. Proceedings ofthePolEval2019Workshop(2019), 149.
[16]
Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems. 3079–3087.
[17]
Prapti Das, Rishmita Tasmim, and Sabir Ismail. 2015. An experimental study of stylometry in bangla literature. In 2015 2nd International Conference on Electrical Information and Communication Technologies (EICT). IEEE, 575–580.
[18]
Suprabhat Das and Pabitra Mitra. 2011. Author identification in bengali literary works. In International Conference on Pattern Recognition and Machine Intelligence. Springer, 220–226.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[20]
Abeer H El Bakly, Nagy Ramadan Darwish, and Hesham A Hefny. 2020. Using Ontology for Revealing Authorship Attribution of Arabic Text. International Journal of Engineering and Advanced Technology (IJEAT) 9, 4(2020), 143–151.
[21]
Olga Fourkioti, Symeon Symeonidis, and Avi Arampatzis. 2019. Language models and fusion for authorship attribution. Information Processing & Management 56, 6 (2019), 102061.
[22]
Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems (2016).
[23]
Erik Goldman and Abel Allison. [n.d.]. Using Grammatical Markov Models for Stylometric Analysis. ([n. d.]).
[24]
Julian Hitschler, Esther van den Berg, and Ines Rehbein. 2017. Authorship attribution with convolutional neural networks and POS-Eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.
[25]
M Tahmid Hossain, Md Moshiur Rahman, Sabir Ismail, and Md Saiful Islam. 2017. A stylometric analysis on Bengali literature for authorship attribution. (2017).
[26]
Md Zobaer Hossain, Md Ashraful Rahman, Md Saiful Islam, and Sudipta Kar. 2020. Banfakenews: A dataset for detecting fake news in bangla. arXiv preprint arXiv:2004.08789(2020).
[27]
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328–339.
[28]
Md Ashikul Islam, Md Minhazul Kabir, Md Saiful Islam, and Ayesha Tasnim. 2018. Authorship Attribution on Bengali Literature using Stylometric Features and Neural Network. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT). IEEE, 360–363.
[29]
Nazmul Islam, Mohammed Moshiul Hoque, and Mohammad Rajib Hossain. 2017. Automatic authorship detection from Bengali text using stylometric approach. In 2017 20th International Conference of Computer and Information Technology (ICCIT). IEEE, 1–6.
[30]
Fereshteh Jafariakinabad, Sansiri Tarnpradab, and Kien A Hua. 2019. Syntactic Recurrent Neural Network for Authorship Attribution. arXiv preprint arXiv:1902.09723(2019).
[31]
Anupam Jamatia, Steve Durairaj Swamy, Björn Gambäck, Amitava Das, and Swapan Debbarma. 2020. Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus. International Journal on Artificial Intelligence Tools (2020).
[32]
Md Jewel, Md Ismail Hossain, and Tamanna Haider Tonni. 2019. Bengali Ethnicity Recognition and Gender Classification using CNN & Transfer Learning. In 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART). IEEE, 390–396.
[33]
Md Karim, Bharathi Raja Chakravarthi, Mihael Arcan, John P McCrae, Michael Cochez, et al. 2020. Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network. arXiv preprint arXiv:2004.07807(2020).
[34]
Aisha Khatun, Anisur Rahman, Md Saiful Islam, et al. 2019. Authorship Attribution in Bangla literature using Character-level CNN. In 2019 22nd International Conference on Computer and Information Technology (ICCIT). IEEE, 1–5.
[35]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).
[36]
Tim Kreutz and Walter Daelemans. 2018. Exploring classifier combinations for language variety identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, Santa Fe, New Mexico, USA, August 20, 2018. 191–198.
[37]
Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959(2018).
[38]
Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226(2018).
[39]
Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291(2019).
[40]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).
[41]
Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Restarts. CoRR (2016).
[42]
S. Lynn-Evans. 2019. Ten Techniques Learned From fast.ai. https://blog.floydhub.com/ten-techniques-from-fast-ai/
[43]
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and Optimizing LSTM Language Models. CoRR (2017).
[44]
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and Optimizing LSTM Language Models. CoRR (2018).
[45]
Zinnia Khan Nishat and Md Shopon. 2019. Unsupervised Pretraining and Transfer Learning-Based Bangla Sign Language Recognition. In International Joint Conference on Computational Intelligence. Springer, 529–540.
[46]
Urmee Pal, Ayesha Siddika Nipu, and Sabir Ismail. 2017. A machine learning approach for stylometric analysis of Bangla literature. (2017).
[47]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237.
[48]
Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2016. A machine learning approach for authorship attribution for Bengali blogs. (2016).
[49]
Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2017. A supervised learning approach for authorship attribution of Bengali literary texts. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16, 4 (2017), 1–15.
[50]
Rahul Pramanik and Soumen Bag. 2020. Segmentation-based recognition system for handwritten Bangla and Devanagari words using conventional classification and transfer learning. IET Image Processing 14, 5 (2020), 959–972.
[51]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
[52]
Sebastian Ruder, Parsa Ghaffari, and John G Breslin. 2016. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686(2016).
[53]
Yunita Sari, Mark Stevenson, and Andreas Vlachos. 2018. Topic or style? exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics. 343–353.
[54]
Sagor Sarker. 2020. BanglaBERT: Bengali Mask Language Model for Bengali Language Understading. https://github.com/sagorbrur/bangla-bert
[55]
Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2014. Authorship attribution with topic models. Computational Linguistics 40, 2 (2014), 269–310.
[56]
Prasha Shrestha, Sebastian Sierra, Fabio Gonzalez, Manuel Montes, Paolo Rosso, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669–674.
[57]
Leslie N. Smith. 2015. No More Pesky Learning Rate Guessing Games. CoRR (2015).
[58]
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
[59]
Kalaivani Sundararajan and Damon Woodard. 2018. What represents “style” in authorship attribution?. In Proceedings of the 27th International Conference on Computational Linguistics. 2814–2822.
[60]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[61]
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In International conference on machine learning. 1058–1066.
[62]
Wikipedia. 2021. Bengali Language. https://en.wikipedia.org/wiki/Bengali_language
[63]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).
[64]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems. 5753–5763.
[65]
Chunxia Zhang, Xindong Wu, Zhendong Niu, and Wei Ding. 2014. Authorship identification from unstructured texts. Knowledge-Based Systems 66 (2014), 99–111.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted
EISSN:2375-4702
Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 22 April 2022
Accepted: 04 April 2022
Revised: 03 April 2022
Received: 31 October 2020

Check for updates

Author Tags

  1. Authorship attribution
  2. Transfer Learning
  3. Language model
  4. AWD-LSTM
  5. Bangla

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 324
    Total Downloads
  • Downloads (Last 12 months)129
  • Downloads (Last 6 weeks)11
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media