Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cluster-based information retrieval using pattern mining

Published: 01 April 2021 Publication History
  • Get Citation Alerts
  • Abstract

    This paper addresses the problem of responding to user queries by fetching the most relevant object from a clustered set of objects. It addresses the common drawbacks of cluster-based approaches and targets fast, high-quality information retrieval. For this purpose, a novel cluster-based information retrieval approach is proposed, named Cluster-based Retrieval using Pattern Mining (CRPM). This approach integrates various clustering and pattern mining algorithms. First, it generates clusters of objects that contain similar objects. Three clustering algorithms based on k-means, DBSCAN (Density-based spatial clustering of applications with noise), and Spectral are suggested to minimize the number of shared terms among the clusters of objects. Second, frequent and high-utility pattern mining algorithms are performed on each cluster to extract the pattern bases. Third, the clusters of objects are ranked for every query. In this context, two ranking strategies are proposed: i) Score Pattern Computing (SPC), which calculates a score representing the similarity between a user query and a cluster; and ii) Weighted Terms in Clusters (WTC), which calculates a weight for every term and uses the relevant terms to compute the score between a user query and each cluster. Irrelevant information derived from the pattern bases is also used to deal with unexpected user queries. To evaluate the proposed approach, extensive experiments were carried out on two use cases: the documents and tweets corpus. The results showed that the designed approach outperformed traditional and cluster-based information retrieval approaches in terms of the quality of the returned objects while being very competitive in terms of runtime.

    References

    [1]
    Chen MS, Han J, and Yu PS Data mining: an overview from a database perspective IEEE Trans Knowl Data Eng 1996 8 6 866-883
    [2]
    Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier
    [3]
    Mitra M and Chaudhuri BB Information retrieval from documents: A survey Information retrieval 2000 2 2-3 141-163
    [4]
    Salton G, Mcgill MJ (1986) Introduction to modern information retrieval (pp. paginas 400)
    [5]
    Efron M (2010) Hashtag retrieval in a microblogging environment. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp 787–788, ACM
    [6]
    Koh YS and Ravana SD Unsupervised rare pattern mining: a survey ACM Transactions on Knowledge Discovery from Data 2016 10 4 45
    [7]
    Tsai CW, Lai CF, Chiang MC, Yang LT, et al. Data mining for internet of things: a survey. IEEE Communications Surveys and Tutorials 2014 16 1 77-97
    [8]
    Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, and Gomide F Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: A survey Inf Sci 2019 490 344-368
    [9]
    Liu X, Croft WB (2004) Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp 186–193, ACM
    [10]
    Lee KS, Croft WB, Allan J (2008) A cluster-based resampling method for pseudo-relevance feedback. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 235–242, ACM
    [11]
    Jin X, Agun D, Yang T, Wu Q, Shen Y, Zhao S (2016) Hybrid indexing for versioned document search with cluster-based retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 377–386, ACM
    [12]
    Levi O, Guy I, Raiber F, and Kurland O Selective cluster presentation on the search results page ACM Transactions on Information Systems (TOIS) 2018 36 3 28
    [13]
    Kurland O Re-ranking search results using language models of query-specific clusters Inf Retr 2009 12 4 437-460
    [14]
    Han J, Pei J, and Yin Y Mining frequent patterns without candidate generation ACM sigmod record 2000 29 2 1-12
    [15]
    Tseng VS, Wu C-W, Shie B-E, Yu PS (2010) Up-growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 253–262, ACM
    [16]
    Raiber F, Kurland O (2013) Ranking document clusters using markov random fields. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 333–342, ACM
    [17]
    Naini KD, Altingovde IS, and Siberski W Scalable and efficient web search result diversification ACM Transactions on the Web (TWEB) 2016 10 3 15
    [18]
    Bhopale AP, Tiwari A (2020) Swarm optimized cluster based framework for information retrieval. Expert Syst Appl, p 113441
    [19]
    Singhal A et al. Modern information retrieval: A brief overview IEEE Data Eng. Bull. 2001 24 4 35-43
    [20]
    Salton G, Fox EA, Wu H (1982) Extended boolean information retrieval. Cornell University
    [21]
    Salton G, Wong A, and Yang C-S A vector space model for automatic indexing Commun ACM 1975 18 11 613-620
    [22]
    Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp 275–281
    [23]
    Wang X, Wei F, Liu X, Zhou M, Zhang M (2011) Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp 1031–1040, ACM
    [24]
    Luo Z, Osborne M, Wang T, et al. (2012) Improving twitter retrieval by exploiting structural information. In: Twenty-Sixth AAAI Conference on Artificial Intelligence
    [25]
    Bansal P, Jain S, Varma V (2015) Towards semantic retrieval of hashtags in microblogs. In: Proceedings of the 24th International Conference on World Wide Web, pp 7–8, ACM
    [26]
    Selvalakshmi B and Subramaniam M Intelligent ontology based semantic information retrieval using feature selection and classification Clust Comput 2019 22 5 12871-12881
    [27]
    Yadav P Cluster based-image descriptors and fractional hybrid optimization for medical image retrieval Clust Comput 2019 22 1 1345-1359
    [28]
    Sheetrit E, Shtok A, Kurland O (2020) A passage-based approach to learning to rank documents. Information Retrieval Journal, 1–28
    [29]
    Dehghan M and Abin AA Translations diversification for expert finding: A novel clustering-based approach ACM Transactions on Knowledge Discovery from Data (TKDD) 2019 13 3 1-20
    [30]
    Ji X, Shen H-W, Ritter A, Machiraju R, and Yen P-Y Visual exploration of neural document embedding in information retrieval: semantics and feature selection IEEE transactions on visualization and computer graphics 2019 25 6 2181-2192
    [31]
    Cai X and Li W Ranking through clustering: An integrated approach to multi-document summarization IEEE Transactions on Audio, Speech, and Language Processing 2013 21 7 1424-1433
    [32]
    Levi O, Raiber F, Kurland O, Guy I (2016) Selective cluster-based document retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 1473–1482, ACM
    [33]
    Sheetrit E, Kurland O (2019) Cluster-based focused retrieval. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp 2305–2308
    [34]
    Tam Y-C (2020) Cluster-based beam search for pointer-generator chatbot grounded by knowledge. Computer Speech & Language, p 101094
    [35]
    Agrawal R, Imieliński T, and Swami A Mining association rules between sets of items in large databases Acm sigmod record 1993 22 2 207-216
    [36]
    Gan W, Lin JC-W, Chao H-C, Fujita H, and Philip SY Correlated utility-based pattern mining Inf Sci 2019 504 470-486
    [37]
    Yun U, Kim D, Yoon E, and Fujita H Damped window based high average utility pattern mining over data streams Knowl-Based Syst 2018 144 188-205
    [38]
    Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th International Conference on Data Engineering, pp 215– 224
    [39]
    Mannila H, Toivonen H, and Verkamo AI Discovery of frequent episodes in event sequences Data Min Knowl Disc 1997 1 3 259-289
    [40]
    Jiang C, Coenen F, and Zito M A survey of frequent subgraph mining algorithms The Knowledge Engineering Review 2013 28 1 75-105
    [41]
    Yao H, Hamilton HJ, Butz CJ (2004) A foundational approach to mining itemset utilities from databases. In: Proceedings of the SIAM International Conference on Data Mining, pp 482–486, SIAM
    [42]
    Fung BC, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 2003 SIAM international conference on data mining, pp 59–70, SIAM
    [43]
    Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: Fourth IEEE International Conference on Data Mining (ICDM’04), pp 563–566, IEEE
    [44]
    Zhong N, Li Y, and Wu S-T Effective pattern discovery for text mining IEEE transactions on knowledge and data engineering 2012 24 1 30-44
    [45]
    Zingla MA, Latiri C, Mulhem P, Berrut C, and Slimani Y Hybrid query expansion model for text and microblog information retrieval Information Retrieval Journal 2018 21 4 337-367
    [46]
    Belhadi A, Djenouri Y, Lin JC-W, Zhang C, and Cano A Exploring pattern mining algorithms for hashtag retrieval problem IEEE Access 2020 8 10569-10583
    [47]
    Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 436–442, ACM
    [48]
    Djenouri Y, Belhadi A, Fournier-Viger P, and Lin JC-W Fast and effective cluster-based information retrieval using frequent closed itemsets Inf Sci 2018 453 154-167
    [49]
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
    [50]
    Jain AK, Murty MN, and Flynn PJ Data clustering: a review ACM computing surveys (CSUR) 1999 31 3 264-323
    [51]
    MacQueen J, et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297, Oakland, CA, USA
    [52]
    Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856
    [53]
    Ester M, Kriegel H-P, Sander J, Xu X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 1996 96 34 226-231
    [54]
    Zhai C (2017) Probabilistic topic models for text data retrieval and analysis. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pp 1399–1401, ACM
    [55]
    Shi B, Poghosyan G, Ifrim G, and Hurley N Hashtagger+: Efficient high-coverage social tagging of streaming news IEEE Trans Knowl Data Eng 2018 30 1 43-58
    [56]
    Makki R, Carvalho E, Soto AJ, Brooks S, Oliveira MCFD, Milios E, and Minghim R Atr-vis: Visual and interactive information retrieval for parliamentary discussions in twitter ACM Transactions on Knowledge Discovery from Data (TKDD) 2018 12 1 3
    [57]
    Stilo G and Velardi P Hashtag sense clustering based on temporal similarity Computational Linguistics 2017 43 1 181-200
    [58]
    Djenouri Y, Habbas Z, and Djenouri D Data mining-based decomposition for solving the maxsat problem: toward a new approach IEEE Intell Syst 2017 32 4 48-58
    [59]
    Djenouri Y, Belhadi A, and Fournier-Viger P Extracting useful knowledge from event logs: a frequent itemset mining approach Knowl-Based Syst 2018 139 132-148
    [60]
    Djenouri Y, Habbas Z, Djenouri D, and Fournier-Viger P Bee swarm optimization for solving the MAXSAT problem using prior knowledge Soft Comput 2019 23 9 3095-3112
    [61]
    Djenouri D, Laidi R, Djenouri Y, and Balasingham I Machine learning for smart building applications: Review and taxonomy ACM Computing Surveys (CSUR) 2019 52 2 24

    Cited By

    View all
    • (2024)A node clustering algorithm for heterogeneous information networks based on node embeddingsMultimedia Tools and Applications10.1007/s11042-023-15245-983:2(3745-3766)Online publication date: 1-Jan-2024
    • (2024)An efficient document information retrieval using hybrid global search optimization algorithm with density based clustering techniqueCluster Computing10.1007/s10586-023-03976-127:1(689-705)Online publication date: 1-Feb-2024
    • (2023)A transformer framework for generating context-aware knowledge graph pathsApplied Intelligence10.1007/s10489-023-04588-353:20(23740-23767)Online publication date: 14-Jul-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Applied Intelligence
    Applied Intelligence  Volume 51, Issue 4
    Apr 2021
    874 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 April 2021
    Accepted: 01 September 2020

    Author Tags

    1. Information retrieval
    2. Data mining
    3. Cluster-based approaches
    4. Frequent and high-utility pattern mining.

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A node clustering algorithm for heterogeneous information networks based on node embeddingsMultimedia Tools and Applications10.1007/s11042-023-15245-983:2(3745-3766)Online publication date: 1-Jan-2024
    • (2024)An efficient document information retrieval using hybrid global search optimization algorithm with density based clustering techniqueCluster Computing10.1007/s10586-023-03976-127:1(689-705)Online publication date: 1-Feb-2024
    • (2023)A transformer framework for generating context-aware knowledge graph pathsApplied Intelligence10.1007/s10489-023-04588-353:20(23740-23767)Online publication date: 14-Jul-2023
    • (2023)Unsupervised Clustering and Explainable AI for Unveiling Behavioral Variations Across Time in Home-Appliance Generated DataInformation Integration and Web Intelligence10.1007/978-3-031-48316-5_17(147-161)Online publication date: 4-Dec-2023
    • (2022)A cross-lingual sentence pair interaction feature capture model based on pseudo-corpus and multilingual embeddingAI Communications10.3233/AIC-21008535:1(1-14)Online publication date: 1-Jan-2022
    • (2022)Managing and Retrieving Bilingual Documents Using Artificial Intelligence-Based Ontological FrameworkComputational Intelligence and Neuroscience10.1155/2022/46369312022Online publication date: 1-Jan-2022
    • (2022)A Belief Two-Level Weighted Clustering Method for Incomplete Pattern Based on Multiview FusionComputational Intelligence and Neuroscience10.1155/2022/28953382022Online publication date: 1-Jan-2022
    • (2022)Clustering via multiple kernel k-means coupled graph and enhanced tensor learningApplied Intelligence10.1007/s10489-022-03679-x53:3(2564-2575)Online publication date: 10-May-2022
    • (2022)A clustering algorithm based on density decreased chain for data with arbitrary shapes and densitiesApplied Intelligence10.1007/s10489-022-03583-453:2(2098-2109)Online publication date: 5-May-2022
    • (2022)A day at the racesApplied Intelligence10.1007/s10489-021-02719-252:5(5617-5632)Online publication date: 1-Mar-2022
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media