Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Unsupervised Derivation of Keyword Summary for Short Texts

Published: 02 June 2021 Publication History

Abstract

Automatically summarizing a group of short texts that mainly share one topic is a fundamental task in many applications, e.g., summarizing the main symptoms for a disease based on a group of medical texts that are usually short, i.e., tens of words. Conventional unsupervised short text summarization techniques tend to find the most representative short text document. However, they may cause privacy issues, e.g., personal information in the medical texts may be exposed. Moreover, compared with the complete short text where some unimportant words may exist, a summary consisting of only a few keywords is more preferable by the user due to its clear and concise form. Due to the above reasons, in this article, we aim to solve the problem of unsupervised derivation of keyword summary for short texts. Existing keyword extraction methods such as Latent Dirichlet Allocation cannot be applied to solve this problem, since (1) the ordering relations among the extracted keywords are ignored, which causes troubles for people to capture the main idea of the event, and (2) short texts contain limited context, which makes it hard to find the optimal words for semantic coverage. Hence, we propose a simple but yet effective method named Frequent Closed Wordsets Ranking (FCWRank) to derive the keyword summary from a short text cluster. FCWRank is an unsupervised method that builds on the idea of frequent closed itemset mining in transaction database. FCWRank first mines all frequent closed wordsets from a cluster of short texts and then selects the most important wordset based on an importance model where the similarity between closed wordsets and the relation between the closed wordset and the short text document are considered simultaneously. To make the keywords within the wordset more understandable, FCWRank further unfolds the semantics behind them by sorting them. Experiments on real-world short text collections show that FCWRank outperforms the state-of-the-art baselines in terms of Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence F1, precision and recall scores.

References

[1]
Laith Mohammad Abualigah and Essam Said Hanandeh. 2015. Applying genetic algorithms to information retrieval using vector space model. Int. J. Comput. Sci. Eng. Appl. 5, 1 (2015), 19–28.
[2]
Laith Mohammad Abualigah and Ahamad Tajudin Khader. 2017. Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J. Supercomput. 73, 11 (2017), 4773–4795.
[3]
Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J. Comput. Sci. 25 (2018), 456–466.
[4]
Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis. Eng. Appl. Artif. Intell. 73 (2018), 111–125.
[5]
Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. Hybrid clustering analysis using improved krill herd algorithm. Appl. Intell. 48, 11 (2018), 4047–4071.
[6]
Laith Mohammad Abualigah, Ahamad Tajudin Khader, Essam Said Hanandeh, and Amir H Gandomi. 2017. A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl. Soft Comput. 60 (2017), 423–435.
[7]
Laith Mohammad Qasim Abualigah. 2019. Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering. Springer.
[8]
Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE’95). IEEE Computer Society, USA, 3–14.
[9]
Mozhgan Nasr Azadani, Nasser Ghadiri, and Ensieh Davoodijam. 2018. Graph-based biomedical text summarization: An itemset mining and sentence clustering approach. J. Biomed. Inf. 84 (2018), 42–58.
[10]
Elena Baralis, Luca Cagliero, Naeem Mahoto, and Alessandro Fiori. 2013. GRAPHSUM: Discovering correlations among multiple terms for graph-based summarization. Inf. Sci. 249 (2013), 96–109.
[11]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan.2003), 993–1022.
[12]
Olivier Bodenreider. 2004. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 32, suppl 1 (2004), D267–D270.
[13]
Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. Btm: Topic modeling over short texts. Trans. Knowl. Data Eng. 26, 12 (2014), 2928–2941.
[14]
Hal Daumé III and Daniel Marcu. 2006. Bayesian query-focused summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’06). 305–312.
[15]
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Assoc. Inf. Sci. Technol. 41, 6 (1990), 391–407.
[16]
Christian Gulden, Melanie Kirchner, Christina Schüttler, Marc Hinderer, Marvin Kampf, Hans-Ulrich Prokosch, and Dennis Toddenroth. 2019. Extractive summarization of clinical trial descriptions. Int. J. Med. Inf. 129 (2019), 114–121.
[17]
Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In ACM Sigmod Rec. 29 (2000). 1–12.
[18]
Charles A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10–16.
[19]
Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. 289–296.
[20]
Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In Proceedings of the 1st Workshop on Social Media Analytics. 80–88.
[21]
Pei Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Meichun Hsu. 2001. PrefixSpan: Mining sequential patterns by prefix-projected growth. In Proceedings of the International Conference on Data Engineering.
[22]
Ou Jin, Nathan N Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the Conference on Information and Knowledge Management (CIKM’11). ACM, 775–784.
[23]
Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. 1999. The web as a graph: Measurements, models, and methods. In International Computing and Combinatorics Conference. Springer, 1–17.
[24]
Tamara G. Kolda, Brett W. Bader, and Joseph P. Kenny. 2005. Higher-order web link analysis using multilinear algebra. In Proceedings of the IEEE International Conference on Data Mining (ICDM’05).
[25]
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval Online (SIGIR’16). 165–174.
[26]
Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: Efficient evaluation of sliding-window aggregates over data streams. ACM Sigmod Rec. 34, 1 (2005), 39–44.
[27]
Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. 2015. Reader-aware multi-document summarization via sparse coding. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’15). 1270–1276.
[28]
Ximing Li, Jiaojiao Zhang, and Jihong Ouyang. 2019. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7884–7891.
[29]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches OutAssociation for Computational Linguistics, Barcelona, Spain, 74–81.
[30]
Marina Litvak and Mark Last. 2008. Graph-based keyword extraction for single-document summarization. In Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization. 17–24.
[31]
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’14). 55–60.
[32]
Héctor D. Menéndez, Laura Plaza, and David Camacho. 2014. Combining graph connectivity and genetic clustering to improve biomedical summarization. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC’14). IEEE, 2740–2747.
[33]
Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’06), Vol. 6. 775–780.
[34]
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).
[35]
Milad Moradi and Nasser Ghadiri. 2017. Quantifying the informativeness for biomedical literature summarization: An itemset mining method. Comput. Methods Progr. Biomed. 146 (2017), 77–89.
[36]
Milad Moradi and Nasser Ghadiri. 2018. Different approaches for identifying important concepts in probabilistic biomedical text summarization. Artif. Intell. Med. 84 (2018), 101–116.
[37]
Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. 2012. Summarizing sporting events using Twitter. In Proceedings of the International Conference on Intelligent User Interfaces (IUI’12). 189–198.
[38]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
[39]
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the International Conference on Database Theory (ICDT’99). Springer, 398–416.
[40]
Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the Annual Conference on the World Wide Web (WWW’08). 91–100.
[41]
Laura Plaza, Alberto Díaz, and Pablo Gervás. 2011. A semantic graph-based approach to biomedical summarisation. Artif. Intell. Med. 53, 1 (2011), 1–14.
[42]
Lawrence Reeve, Hyoil Han, and Ari D. Brooks. 2006. BioChain: Lexical chaining methods for biomedical text summarization. In Proceedings of the 2006 ACM Symposium on Applied Computing. 180–184.
[43]
Oussama Rouane, Hacene Belhadef, and Mustapha Bouakkaz. 2019. Combine clustering and frequent itemsets mining to enhance biomedical text summarization. Expert Syst. Appl. 135 (2019), 362–373.
[44]
Hassan Sayyadi and Lise Getoor. 2009. Futurerank: Ranking scientific articles by predicting their future pagerank. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 533–544.
[45]
Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of the International Conference on Machine Learning (ICML’14). 190–198.
[46]
Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the International Conference on Machine Learning (ICML’06). 977–984.
[47]
Zhongyuan Wang and Haixun Wang. 2016. Understanding short texts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16) (Tutorial).
[48]
Zhongqing Wang and Yue Zhang. 2017. A neural model for joint event detection and summarization. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17).
[49]
Ho Chung Wu, Robert Wing Pong Luk, Kam Fai Wong, and Kui Lam Kwok. 2008. Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26, 3 (2008), 13.
[50]
Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, Ee-Peng Lim, and Xiaoming Li. 2011. Topical keyphrase extraction from Twitter. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’11). 379–388.
[51]
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval (ECIR’11). 338–349.
[52]
George Kingsley Zipf. 2016. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books.
[53]
Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: A simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48, 2 (2016), 379–398.

Cited By

View all
  • (2023)Graph-Based Extractive Text Summarization Sentence Scoring Scheme for Big Data ApplicationsInformation10.3390/info1409047214:9(472)Online publication date: 22-Aug-2023
  • (2023)Intelligent mining of safety hazard information from construction documents using semantic similarity and information entropyEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105742119:COnline publication date: 1-Mar-2023
  • (2023)Automatic text summarization using deep reinforced model coupling contextualized word representation and attention mechanismMultimedia Tools and Applications10.1007/s11042-023-15589-283:1(733-762)Online publication date: 23-May-2023
  • Show More Cited By

Index Terms

  1. Unsupervised Derivation of Keyword Summary for Short Texts

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Internet Technology
    ACM Transactions on Internet Technology  Volume 21, Issue 2
    June 2021
    599 pages
    ISSN:1533-5399
    EISSN:1557-6051
    DOI:10.1145/3453144
    • Editor:
    • Ling Liu
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 June 2021
    Online AM: 07 May 2020
    Accepted: 01 April 2020
    Revised: 01 April 2020
    Received: 01 February 2020
    Published in TOIT Volume 21, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Keyword summary
    2. short texts
    3. unsupervised

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key R&D Program of China
    • Zhejiang Lab
    • Fundamental Research Funds for the Provincial Universities of Zhejiang

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Graph-Based Extractive Text Summarization Sentence Scoring Scheme for Big Data ApplicationsInformation10.3390/info1409047214:9(472)Online publication date: 22-Aug-2023
    • (2023)Intelligent mining of safety hazard information from construction documents using semantic similarity and information entropyEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105742119:COnline publication date: 1-Mar-2023
    • (2023)Automatic text summarization using deep reinforced model coupling contextualized word representation and attention mechanismMultimedia Tools and Applications10.1007/s11042-023-15589-283:1(733-762)Online publication date: 23-May-2023
    • (2022)Sentiment Analysis of Roman Urdu on E-Commerce Reviews Using Machine LearningComputer Modeling in Engineering & Sciences10.32604/cmes.2022.019535131:3(1263-1287)Online publication date: 2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media