Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Comparative Analysis on Hindi and English Extractive Text Summarization

Published: 09 May 2019 Publication History

Abstract

Text summarization is the process of transfiguring a large documental information into a clear and concise form. In this article, we present a detailed comparative study of various extractive methods for automatic text summarization on Hindi and English text datasets of news articles. We consider 13 different summarization techniques, namely, TextRank, LexRank, Luhn, LSA, Edmundson, ChunkRank, TGraph, UniRank, NN-ED, NN-SE, FE-SE, SummaRuNNer, and MMR-SE, and we evaluate their performance using various performance metrics, such as precision, recall, F1, cohesion, non-redundancy, readability, and significance. A thorough analysis is done in eight different parts that exhibits the strengths and limitations of these methods, effect of performance over the summary length, impact of language of a document, and other factors as well. A standard summary evaluation tool (ROUGE) and extensive programmatic evaluation using Python 3.5 in Anaconda environment are used to evaluate their outcome.

References

[1]
Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 2, 159--165.
[2]
Dipanjan Das and Andre F. T. Martins. 2007. A survey on automatic text summarization. Lit. Survey Lang. Stat. 4, 192--195.
[3]
Ehsan Shareghi and Leila Sharif Hassanabadi. 2008. Text summarization with harmony search algorithm-based sentence extraction. Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology. ACM. 226--231.
[4]
K. Sankar and L. Sobha. 2009. An approach to text summarization. Proceedings of the 3rd International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies. ACL. 53--60.
[5]
Daraksha Parveen, Mohsen Mesgar, and Michael Strube. 2016. Generating coherent summaries of scientific articles using coherence patterns. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 773--783.
[6]
Pradeepika Verma and Hari Om. 2019. MCRMR: Maximum coverage and relevancy with minimal redundancy-based multi-document summarization. Expert Syst. Appl. 120, 43--56.
[7]
Harold P. Edmundson. 1969. New methods in automatic extracting. J. ACM 16, 2, 264--285.
[8]
Gunes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artific. Intell. Res. 22, 457--479.
[9]
Josef Steinberger and Karel Jezek. 2004. Using latent semantic analysis in text summarization and summary evaluation. Proceedings of the International Conference on Information System Implementation and Modeling (ISIM’04). 93--100.
[10]
Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel Pereira e Silva, Fred Freitas, George D. C. Cavalcanti, Rinaldo Lima, Steven J. Simske, and Luciano Favaro. 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40, 14, 5755--5764.
[11]
Sandeep Sripada, Venu Gopal Kasturi, and Gautam Kumar Parai. 2005. Multi-document extraction-based Summarization. CS 224N, Final Project. https://nlp.stanford.edu/courses/cs224n/2010/reports/ssandeep-venuk-gkparai.pdf.
[12]
Xiaojun Wan. 2010. Towards a unified approach to simultaneous single-document and multi-document summarizations. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL. 1137--1145.
[13]
Janara Christensen, Stephen Soderland, and Oren Etzioni. 2013. Towards coherent multi-document summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1163--1173.
[14]
Daraksha Parveen, Hans-Martin Ramsl, and Michael Strube. 2015. Topical coherence for graph-based extractive summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1949--1954.
[15]
Pradeepika Verma and Hari Om. 2019. Collaborative ranking-based text summarization using a metaheuristic approach. In Proceedings of the Emerging Technologies in Data Mining and Information Security. Springer. 417--426.
[16]
Hayato Kobayashi, Masaki Noguchi, and Taichi Yatsuka. 2015. Summarization based on embedding distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL. 1984--1989.
[17]
Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252.
[18]
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A recurrent neural network-based sequence model for extractive summarization of documents. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI’17). 3075--3081.
[19]
Rasim M. Alguliev, Ramiz M. Aliguliyev, Makrufa S. Hajirahimova, and Chingiz A. Mehdiyev. 2011. MCMR: Maximum coverage and minimum redundant text summarization model. Expert Syst. Appl. 38, 12, 14514--14522.
[20]
Rasim M. Alguliev, Ramiz M. Aliguliyev, and Nijat R. Isazade. 2013. Multiple documents summarization based on evolutionary optimization algorithm. Expert Syst. Appl. 40, 5, 1675--1689.
[21]
Atif Khan, Naomie Salim, and Yogan Jaya Kumar. 2015. A framework for multi-document abstractive summarization based on semantic role labelling. Appl. Soft Comput. 30, 737--747.
[22]
Razieh Abbasi-ghalehtaki, Hassan Khotanlou, and Mansour Esmaeilpour. 2016. Fuzzy evolutionary cellular learning automata model for text summarization. Swarm Evolution. Comput. 30, 11--26.
[23]
Rasmita Rautray and Rakesh Chandra Balabantaray. 2017. Cat swarm optimization-based evolutionary framework for multi document summarization. Physica A: Stat. Mech. Appl. 477, 174--186.
[24]
Pradeepika Verma and Hari Om. 2019. A variable dimension optimization approach for text summarization. In Proceedings of the Harmony Search and Nature Inspired Optimization Algorithms. Springer. 687--696.
[25]
Vishal Gupta and Gurpreet Singh Lehal. 2010. A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2, 3, 258--268.
[26]
Mahak Gambhir and Vishal Gupta. 2017. Recent automatic text summarization techniques: A survey. Artific. Intell. Rev. 47, 1, 1--66.
[27]
N. Moratanch and S. Chitrakala. 2016. A survey on abstractive text summarization. In Proceedings of the Conference on Circuit, Power and Computing Technologies (ICCPCT’16). IEEE. 1--7.
[28]
Christopher C. Yang and Kar Wing Li. 2003. Automatic construction of English/Chinese parallel corpora. J. Amer. Soc. Info. Sci. Technol. 54, 8, 730--742.
[29]
Eduard Hovy and Chin-Yew Lin. 1998. Automated text summarization and the SUMMARIST system. In Proceedings of the Association for Computational Linguistics Workshop. ACL. 13--15.
[30]
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
[31]
Chin-Yew Lin. 2004. Looking for a few good metrics: Automatic summarization evaluation—How many samples are enough? In Proceedings of NII Testbeds and Community for Information Access Research.
[32]
Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A graph-based approach to abstractive summarization of highly redundant opinions. Proceedings of the 23rd International Conference on Computational Linguistics. ACL. 340--348.
[33]
Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. A comparative study on ranking and selection strategies for multi-document summarization. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL. 525--533.
[34]
Eleni Galiotou, Nikitas Karanikolas, and Christodoulos Tsoulloftas. 2013. On the effect of stemming algorithms on extractive summarization: A case study. Proceedings of the 17th Panhellenic Conference on Informatics. ACM. 300--304.
[35]
P. M. Dhanya and M. Jathavedan. 2013. Comparative study of text summarization in Indian Languages. Int. J. Comput. Appl. 75, 6.
[36]
K. Vimal Kumar, Divakar Yadav, and Arun Sharma. 2015. Graph-based technique for hindi text summarization. Information Systems Design and Intelligent Applications. Springer, New Delhi, 301--310.
[37]
K. Vimal Kumar and Divakar Yadav. 2015. An improvised extractive approach to hindi text summarization. Information Systems Design and Intelligent Applications. Springer, New Delhi, 291--300.
[38]
C. Sunitha, A. Jaya, and Amal Ganesh. 2016. A study on abstractive summarization techniques in indian languages. Procedia Comput. Sci. 87, 25--31.
[39]
Pradeepika Verma and Hari Om. 2016. Extraction-based text summarization methods on user’s review data: A comparative study. In Proceedings of the Conference on Smart Trends for Information Technology and Computer Communications. Springer, Singapore. 346--354.
[40]
Inderjeet Mani and Mark T. Maybury. 1999. Advances in Automatic Text Summarization. MIT Press.
[41]
Jade Goldstein and Jaime Carbonell. 1998. Summarization: (1) using MMR for diversity-based reranking and (2) evaluating summaries. Proceedings of the Association for Computational Linguistics Workshop. ACL. 181--195.
[42]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[43]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
[44]
Michael Alexander Kirkwood Halliday and Ruqaiya Hasan. 2014. Cohesion in English. Routledge.
[45]
Houda Oufaida, Omar Nouali, and Philippe Blache. 2014. Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization. J. King Saud Univ.-Comput. Info. Sci. 26, 4, 450--461.
[46]
Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. 2000. Multi-document summarization by sentence extraction. In Proceedings of the NAACL-ANLP Workshop on Automatic Summarization. ACL. 40--48.
[47]
Ondrej Bojar, Vojtech Diatka, Pavel Rychly, Pavel Stranik, Vat Suchomel, Ales Tamchyna, and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14). 3550--3555.
[48]
William H. DuBay. 2004. The Principles of Readability. ERIC. Online Submission. https://files.eric.ed.gov/fulltext/ED490073.pdf.
[49]
Ray R. Larson. 2010. Introduction to information retrieval. J. Amer. Soc. Info. Sci. Technol. 4, 852--853.

Cited By

View all
  • (2024)Emotional and Mental Nuances and Technological Approaches: Optimising Fact-Check Dissemination through Cognitive Reinforcement TechniqueElectronics10.3390/electronics1301024013:1(240)Online publication date: 4-Jan-2024
  • (2024)AI-Driven Summarization of Academic Literature using Transformer Model2024 Second International Conference on Inventive Computing and Informatics (ICICI)10.1109/ICICI62254.2024.00065(359-364)Online publication date: 11-Jun-2024
  • (2024)Exploring Text Summarization Techniques: A Review of Current Challenges and Future Directions2024 2nd International Conference on Disruptive Technologies (ICDT)10.1109/ICDT61202.2024.10489243(289-295)Online publication date: 15-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 3
September 2019
386 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3305347
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2019
Accepted: 01 January 2019
Revised: 01 October 2018
Received: 01 September 2017
Published in TALLIP Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ROUGE
  2. Text summarization
  3. graph-based techniques
  4. latent semantic analysis
  5. meta-heuristic-based techniques
  6. neural networks-based techniques

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)2
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Emotional and Mental Nuances and Technological Approaches: Optimising Fact-Check Dissemination through Cognitive Reinforcement TechniqueElectronics10.3390/electronics1301024013:1(240)Online publication date: 4-Jan-2024
  • (2024)AI-Driven Summarization of Academic Literature using Transformer Model2024 Second International Conference on Inventive Computing and Informatics (ICICI)10.1109/ICICI62254.2024.00065(359-364)Online publication date: 11-Jun-2024
  • (2024)Exploring Text Summarization Techniques: A Review of Current Challenges and Future Directions2024 2nd International Conference on Disruptive Technologies (ICDT)10.1109/ICDT61202.2024.10489243(289-295)Online publication date: 15-Mar-2024
  • (2024)Multimodal sentiment analysis of english and hinglish memesMultimedia Tools and Applications10.1007/s11042-024-19640-8Online publication date: 20-Jun-2024
  • (2024)Automatic Text Summarization: Methods, Metrics and DatasetsProceedings of the Second International Conference on Computing, Communication, Security and Intelligent Systems10.1007/978-981-99-8398-8_6(83-97)Online publication date: 28-Mar-2024
  • (2024)Analysis and Performance of Text Summarization Tools Applied on Indian LanguagesProceedings of International Conference on Recent Innovations in Computing10.1007/978-981-97-2839-8_28(407-418)Online publication date: 13-Jul-2024
  • (2024)Text Summarization Techniques for the Bengali Language: SurveyProceedings of International Conference on Recent Innovations in Computing10.1007/978-981-97-2839-8_26(379-392)Online publication date: 13-Jul-2024
  • (2024)IndicBART Alongside Visual Element: Multimodal Summarization in Diverse Indian LanguagesDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70552-6_16(264-280)Online publication date: 30-Aug-2024
  • (2024)Overview of Approaches for Increasing Coherence in Extractive SummariesAdvances in Information and Communication10.1007/978-3-031-53963-3_41(592-609)Online publication date: 17-Mar-2024
  • (2023)Automated Detection of Persuasive Content in Electronic NewsInformatics10.3390/informatics1004008610:4(86)Online publication date: 21-Nov-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media