Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An ensemble clustering approach for topic discovery using implicit text segmentation

Published: 01 August 2021 Publication History

Abstract

Text segmentation (TS) is the process of dividing multi-topic text collections into cohesive segments using topic boundaries. Similarly, text clustering has been renowned as a major concern when it comes to multi-topic text collections, as they are distinguished by sub-topic structure and their contents are not associated with each other. Existing clustering approaches follow the TS method which relies on word frequencies and may not be suitable to cluster multi-topic text collections. In this work, we propose a new ensemble clustering approach (ECA) is a novel topic-modelling-based clustering approach, which induces the combination of TS and text clustering. We improvised a LDA-onto (LDA-ontology) is a TS-based model, which presents a deterioration of a document into segments (i.e. sub-documents), wherein each sub-document is associated with exactly one sub-topic. We deal with the problem of clustering when it comes to a document that is intrinsically related to various topics and its topical structure is missing. ECA is tested through well-known datasets in order to provide a comprehensive presentation and validation of clustering algorithms using LDA-onto. ECA exhibits the semantic relations of keywords in sub-documents and resultant clusters belong to original documents that they contain. Moreover, present research sheds the light on clustering performances and it indicates that there is no difference over performances (in terms of F-measure) when the number of topics changes. Our findings give above par results in order to analyse the problem of text clustering in a broader spectrum without applying dimension reduction techniques over high sparse data. Specifically, ECA provides an efficient and significant framework than the traditional and segment-based approach, such that achieved results are statistically significant with an average improvement of over 10.2%. For the most part, proposed framework can be evaluated in applications where meaningful data retrieval is useful, such as document summarization, text retrieval, novelty and topic detection.

References

[1]
Gennady S, Polina K, and Nikita N et al. Applying topic segmentation to document-level information retrieval. In: Proceeding of the 14th conference on Central and Eastern European software engineering, Moscow, 12–13 October 2018, p. 3484. New York: Association for Computing Machinery.
[2]
Cai X and Li W. A spectral analysis approach to document summarization: clustering and ranking sentences simultaneously. Inform Sciences 2011; 181(18): 3816–3827.
[3]
Riken S, Deesha S, and Lakshmi K. Automatic question generation for intelligent tutoring systems. In: Proceedings of the 2nd international conference on communication systems, computing and it applications (CSCITA), Mumbai, India, 7–8 April 2017, pp. 127–132. New York: IEEE.
[4]
Xiao L and Cornoy N. Discourse relations in rationale-containing text segments. J Assoc Inf Sci Tech 2017; 66(12): 2783–2794.
[5]
Zulkefli NSSB, Rahman NBA, and Puteh MB et al. Effectiveness of Latent Dirichlet allocation model for semantic information retrieval on Malay document. In: Fourth international conference on information retrieval and knowledge management (CAMP), Kota Kinabalu, Malaysia, 26–28 March 2018, pp. 101–106. New York: IEEE.
[6]
Li X and Lei L. A bibliometric analysis of topic modelling studies (2000–2017). J Inform Sci. Epub ahead of print 20 September 2019.
[7]
Bouguettaya A, Yu Q, and Liu X et al. Efficient agglomerative hierarchical clustering. Expert Syst Appl 2015; 42(50): 2785–2797.
[8]
Ganguly D. A Fast partitional clustering algorithm based on nearest neighbours heuristics. Pattern Recogn Lett 2018; 112: 198–204.
[9]
Rathore AS and Roy D. Performance of LDA and DCT models. J Inform Sci 2014; 40(3): 281–292.
[10]
Damien M and Claire GI. Model based clustering for mixed data: clustMD. Adv Data Anal Classi 2016; 10(2): 155–169.
[11]
Zhang P and He Z. Using data-driven feature enrichment of text representation and ensemble technique for sentence-level polarity classification. J Inform Sci 2015; 41(4): 531–549.
[12]
Tagarelli A and Karypis G. A segment-based approach to clustering multi-topic documents. Knowl Inf Syst 2013; 34(3): 563–595.
[13]
Cunnings I and Sturt P. Retrieval interference and semantic interpretation. J Mem Lang 2018; 102: 16–27.
[14]
Tugba Y, Banu D, and Savas Y. Turkish synonym identification from multiple resources: monolingual corpus, mono/bilingual online dictionaries and WordNet. Turk J Electr Eng Co 2017; 25(2): 752–760.
[15]
Schwarz C. Ldagibbs: a command for topic modeling in Stata using Latent Dirichlet allocation. Stata 2018; 18(1): 101–117.
[16]
Corrêa EA, Lopes AA, and Amancio DR. Word sense disambiguation: a complex network approach. Inform Sciences 2018; 442–443: 103–113.
[17]
Auer S, Bizer C, and Kobilarov G et al. DBpedia: a nucleus for a web of open data. In: Proceedings of the 6th international the semantic web and 2nd Asian conference on Asian semantic web, Busan, South Korea, 11–15 November 2007, pp. 722–735. Cham: Springer.
[18]
Bizer C, Heath T, and Berners–Lee T. Linked data – the story so far. Int J Semant Web Inf 2009; 5(3): 1–22.
[19]
Vrandecic D and Krotzsch M. Wikidata: a free collaborative knowledge base. Commu ACM 2014; 57(10): 78–85.
[20]
Suchanek FM, Kasneci G, and Weikum G. Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), New York, 8–12 May 2007, pp. 697–706. New York: Association for Computing Machinery.
[21]
Zhao H, Salloum S, and Cai Y et al. Ensemble subspace clustering of text data using two-level features. Int J Mach Learn Cyb 2017; 8(6): 1–16.
[22]
Capó M, Pérez A, and Lozano JA. An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 2017; 117: 56–69.
[23]
Anuar FM, Setchi R, and Lai Y-K. Semantic retrieval of trademarks based on conceptual similarity. IEEE Trans Syst Man Cyb 2016; 46(2): 220–233.
[24]
Chris B, Stefano F, and Alexander P et al. A framework for enriching lexical semantic resources with distributional semantics. Nat Lang Eng 2018; 24(2): 265–312.
[25]
Memon MQ, He J, and Lu Y et al. An Improvised Sub-Document Based Framework for Efficient Document Clustering. Journal of Internet Technology 2019; 20(4): 1191–203.
[26]
Hearst MA. TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist 1997; 23(1): 33–64.
[27]
Madhusudanan N, Amaresh C, and Gurumoorthy B. Discourse analysis based segregation of relevant document segments for knowledge acquisition. AI EDAM 2016; 30(4): 446–465.
[28]
Saurav M and George K. Text segmentation on multilabel documents: a distant-supervised approach. In: Proceedings of the 18th IEEE international conference on data mining, ICDM, Singapore, 17 November 2018, pp. 1170–1175. New York: IEEE.
[29]
Utiyama M and Isahara H. A statistical model for domain-independent text segmentation. In: Proceedings of the 39th annual meeting on association for computational linguistics, Toulouse, 6–11 July 2001, pp. 499–506. New York: Association for Computational Linguistics.
[30]
Sunil WR and Deepa A. A novel approach of augmenting training data for legal text segmentation by leveraging domain knowledge. In: Proceedings of the 4th international symposium on intelligent systems, technologies and applications, advances in intelligent systems and computing, 24 February 2019, vol. 910, pp. 53–63. Cham: Springer.
[31]
Mostafa B and Séamus L. C-HTS: a concept-based hierarchical text segmentation approach. In: Proceedings of the 11th international conference on language resources and evaluation, Miyazaki, Japan, 7–21 May 2018, pp. 1519–1528. European Language Resources Association (ELRA)
[32]
Taeho J. Using K nearest neighbors for text segmentation with feature similarity. In: Proceedings of the International Conference on Communication, Control, Computing and Electronics Engineering, ICCCCEE, Khartoum, Sudan, 16–18 January 2017.
[33]
Doina T, Diana I, and Gabriela C. Text segmentation using Roget-based weighted lexical chains. Comput Informat 2013; 32(2): 393–410.
[34]
Ji-Wei W, Judy TCR, and Wen-Nung T. A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimization. Integr Comput-Aid Eng 2014; 21(1): 35–46.
[35]
Misra H, Yvon F, and Cappé O et al. Text segmentation: a topic modeling perspective. Infor Process Manag 2011; 47(4): 528–544.
[36]
Kaimin Y, Zhe L, and Genliang G et al. Unsupervised text segmentation using LDA and MCMC. In: Proceeding of 10th Australian data mining conference (AusDM, 2012), Conferences in research and practice in information technology series, Sydney, NSW, Australia, 5–7 December 2012, vol. 134, pp. 21–26. New York: Association for Computational Linguistics.
[37]
Bayomi M, Levacher K, and Ghorab MR et al. OntoSeg: a novel approach to text segmentation using ontological similarity. In: Proceeding of the 15th international conference on data mining workshops, Atlantic City, NJ, 14–17 November 2015, pp. 1274–1281. New York: IEEE.
[38]
Goran G, Federico N, and Paolo PS. Unsupervised text segmentation using semantic relatedness graphs. In: Proceedings of the 5th joint conference on lexical and computational semantics, Berlin, 11 August 2016, pp. 125–130. New York: Association for Computational Linguistics.
[39]
Riedl M and Biemann C. Topic tiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 student research workshop, Jeju Island, South Korea, 9–11 July 2012, pp. 37–42. New York: Association for Computational Linguistics.
[40]
Bagheri A, Saraee M, and de Jong F. ADM-LDA: an aspect detection model based on topic modelling using the structure of review sentences. J Inform Sci 2014; 40(5): 621–636.
[41]
Omar M, On B-W, and Lee I et al. LDA topics: representation and evaluation. J Inform Sci 2015; 41(5): 662–675.
[42]
Kamal AS, Zuping Z, and Yang K. Latent semantic analysis approach for document summarization based on word embeddings. KSII Trans Internet Inf Syst 2019; 13(1): 254–276.
[43]
Gutiérrez-Batista K, Campaña JR, and Vila M-A et al. An ontology-based framework for automatic topic detection in multilingual environments. Int J Intell Syst 2018; 33: 1459–1475.
[44]
Jui-Feng Y, Yi-Shang T, and Chen-Hsien L. Topic detection and tracking for conversational content by using conceptual dynamic latent dirichlet allocation. Neurocomputing 2016; 216: 310–318.
[45]
Rifki A, Retno K, and Rahmat G. Topic labeling towards news document collection based on latent dirichlet allocation and ontology. In: 1st international conference on informatics and computational sciences (ICICOS), Semarang, Indonesia, 15–16 November 2017, pp. 247–251. New York: IEEE.
[46]
Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 1995; 82(4): 711–732.
[47]
Beeferman D, Berger A, and Lafferty J. Statistical models for text segmentation. Mach Learn 1999; 34(1–3): 177–210.
[48]
Pevzner L and Hearst MA. A critique and improvement of an evaluation metric for text segmentation. Comput Linguist 2002; 28(1): 19–36.
[49]
Misra H, Cappe O, and Yvon F. Using LDA to detect semantically incoherent documents. In: Proceedings of the 12th conference on computational natural language learning, Manchester, 16–17 August 2008, pp. 41–48. New York: Association for Computational Linguistics.
[50]
Choi FYY. Advances in domain independent linear text segmentation. In. Proceedings of the conference of 1st north American chapter of the association for computational linguistics conference (NAACL 2000), Seattle, WA, 27 April-4 May 2000, pp. 26–33. New York: Association for Computational Linguistics.
[52]
Lewis DD, Yang Y, and Rose TG et al. RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 2004; 5: 361–397.
[53]
Han EH, Boley D, and Gini M et al. WebACE: a web agent for document categorization and exploration. In: Proceedings of the 2nd international conference autonomous agents, Minneapolis, MN, 1 May 1998, pp. 408–415. New York: Association for Computational Linguistics.
[54]
Voorhees E and Harman D. The text retrieval conferences (TRECS). In: Proceedings of a workshop, Baltimore, MD, 13–15 October 1998, pp. 241–273. New York: Association for Computational Linguistics.

Cited By

View all
  • (2022)Collaborative Multi-agent System for Automatic Linear Text SegmentationPRIMA 2022: Principles and Practice of Multi-Agent Systems10.1007/978-3-031-21203-1_35(573-581)Online publication date: 16-Nov-2022

Index Terms

  1. An ensemble clustering approach for topic discovery using implicit text segmentation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Journal of Information Science
        Journal of Information Science  Volume 47, Issue 4
        Aug 2021
        102 pages

        Publisher

        Sage Publications, Inc.

        United States

        Publication History

        Published: 01 August 2021

        Author Tags

        1. Information retrieval
        2. natural language processing
        3. ontological similarity
        4. text clustering
        5. text mining
        6. text segmentation

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 04 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Collaborative Multi-agent System for Automatic Linear Text SegmentationPRIMA 2022: Principles and Practice of Multi-Agent Systems10.1007/978-3-031-21203-1_35(573-581)Online publication date: 16-Nov-2022

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media