research-article

Cluster-based information retrieval using pattern mining

Authors:

Youcef Djenouri,

Djamel Djenouri,

Jerry Chun-Wei LinAuthors Info & Claims

Applied Intelligence, Volume 51, Issue 4

Pages 1888 - 1903

https://doi.org/10.1007/s10489-020-01922-x

Published: 01 April 2021 Publication History

Abstract

This paper addresses the problem of responding to user queries by fetching the most relevant object from a clustered set of objects. It addresses the common drawbacks of cluster-based approaches and targets fast, high-quality information retrieval. For this purpose, a novel cluster-based information retrieval approach is proposed, named Cluster-based Retrieval using Pattern Mining (CRPM). This approach integrates various clustering and pattern mining algorithms. First, it generates clusters of objects that contain similar objects. Three clustering algorithms based on k-means, DBSCAN (Density-based spatial clustering of applications with noise), and Spectral are suggested to minimize the number of shared terms among the clusters of objects. Second, frequent and high-utility pattern mining algorithms are performed on each cluster to extract the pattern bases. Third, the clusters of objects are ranked for every query. In this context, two ranking strategies are proposed: i) Score Pattern Computing (SPC), which calculates a score representing the similarity between a user query and a cluster; and ii) Weighted Terms in Clusters (WTC), which calculates a weight for every term and uses the relevant terms to compute the score between a user query and each cluster. Irrelevant information derived from the pattern bases is also used to deal with unexpected user queries. To evaluate the proposed approach, extensive experiments were carried out on two use cases: the documents and tweets corpus. The results showed that the designed approach outperformed traditional and cluster-based information retrieval approaches in terms of the quality of the returned objects while being very competitive in terms of runtime.

References

[1]

Chen MS, Han J, and Yu PS Data mining: an overview from a database perspective IEEE Trans Knowl Data Eng 1996 8 6 866-883

[2]

Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier

[3]

Mitra M and Chaudhuri BB Information retrieval from documents: A survey Information retrieval 2000 2 2-3 141-163

[4]

Salton G, Mcgill MJ (1986) Introduction to modern information retrieval (pp. paginas 400)

[5]

Efron M (2010) Hashtag retrieval in a microblogging environment. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp 787–788, ACM

[6]

Koh YS and Ravana SD Unsupervised rare pattern mining: a survey ACM Transactions on Knowledge Discovery from Data 2016 10 4 45

[7]

Tsai CW, Lai CF, Chiang MC, Yang LT, et al. Data mining for internet of things: a survey. IEEE Communications Surveys and Tutorials 2014 16 1 77-97

[8]

Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, and Gomide F Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: A survey Inf Sci 2019 490 344-368

[9]

Liu X, Croft WB (2004) Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp 186–193, ACM

[10]

Lee KS, Croft WB, Allan J (2008) A cluster-based resampling method for pseudo-relevance feedback. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 235–242, ACM

[11]

Jin X, Agun D, Yang T, Wu Q, Shen Y, Zhao S (2016) Hybrid indexing for versioned document search with cluster-based retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 377–386, ACM

[12]

Levi O, Guy I, Raiber F, and Kurland O Selective cluster presentation on the search results page ACM Transactions on Information Systems (TOIS) 2018 36 3 28

[13]

Kurland O Re-ranking search results using language models of query-specific clusters Inf Retr 2009 12 4 437-460

[14]

Han J, Pei J, and Yin Y Mining frequent patterns without candidate generation ACM sigmod record 2000 29 2 1-12

[15]

Tseng VS, Wu C-W, Shie B-E, Yu PS (2010) Up-growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 253–262, ACM

[16]

Raiber F, Kurland O (2013) Ranking document clusters using markov random fields. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 333–342, ACM

[17]

Naini KD, Altingovde IS, and Siberski W Scalable and efficient web search result diversification ACM Transactions on the Web (TWEB) 2016 10 3 15

[18]

Bhopale AP, Tiwari A (2020) Swarm optimized cluster based framework for information retrieval. Expert Syst Appl, p 113441

[19]

Singhal A et al. Modern information retrieval: A brief overview IEEE Data Eng. Bull. 2001 24 4 35-43

[20]

Salton G, Fox EA, Wu H (1982) Extended boolean information retrieval. Cornell University

[21]

Salton G, Wong A, and Yang C-S A vector space model for automatic indexing Commun ACM 1975 18 11 613-620

[22]

Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp 275–281

[23]

Wang X, Wei F, Liu X, Zhou M, Zhang M (2011) Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp 1031–1040, ACM

[24]

Luo Z, Osborne M, Wang T, et al. (2012) Improving twitter retrieval by exploiting structural information. In: Twenty-Sixth AAAI Conference on Artificial Intelligence

[25]

Bansal P, Jain S, Varma V (2015) Towards semantic retrieval of hashtags in microblogs. In: Proceedings of the 24th International Conference on World Wide Web, pp 7–8, ACM

[26]

Selvalakshmi B and Subramaniam M Intelligent ontology based semantic information retrieval using feature selection and classification Clust Comput 2019 22 5 12871-12881

[27]

Yadav P Cluster based-image descriptors and fractional hybrid optimization for medical image retrieval Clust Comput 2019 22 1 1345-1359

[28]

Sheetrit E, Shtok A, Kurland O (2020) A passage-based approach to learning to rank documents. Information Retrieval Journal, 1–28

[29]

Dehghan M and Abin AA Translations diversification for expert finding: A novel clustering-based approach ACM Transactions on Knowledge Discovery from Data (TKDD) 2019 13 3 1-20

[30]

Ji X, Shen H-W, Ritter A, Machiraju R, and Yen P-Y Visual exploration of neural document embedding in information retrieval: semantics and feature selection IEEE transactions on visualization and computer graphics 2019 25 6 2181-2192

[31]

Cai X and Li W Ranking through clustering: An integrated approach to multi-document summarization IEEE Transactions on Audio, Speech, and Language Processing 2013 21 7 1424-1433

[32]

Levi O, Raiber F, Kurland O, Guy I (2016) Selective cluster-based document retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 1473–1482, ACM

[33]

Sheetrit E, Kurland O (2019) Cluster-based focused retrieval. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp 2305–2308

[34]

Tam Y-C (2020) Cluster-based beam search for pointer-generator chatbot grounded by knowledge. Computer Speech & Language, p 101094

[35]

Agrawal R, Imieliński T, and Swami A Mining association rules between sets of items in large databases Acm sigmod record 1993 22 2 207-216

[36]

Gan W, Lin JC-W, Chao H-C, Fujita H, and Philip SY Correlated utility-based pattern mining Inf Sci 2019 504 470-486

[37]

Yun U, Kim D, Yoon E, and Fujita H Damped window based high average utility pattern mining over data streams Knowl-Based Syst 2018 144 188-205

[38]

Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th International Conference on Data Engineering, pp 215– 224

[39]

Mannila H, Toivonen H, and Verkamo AI Discovery of frequent episodes in event sequences Data Min Knowl Disc 1997 1 3 259-289

[40]

Jiang C, Coenen F, and Zito M A survey of frequent subgraph mining algorithms The Knowledge Engineering Review 2013 28 1 75-105

[41]

Yao H, Hamilton HJ, Butz CJ (2004) A foundational approach to mining itemset utilities from databases. In: Proceedings of the SIAM International Conference on Data Mining, pp 482–486, SIAM

[42]

Fung BC, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 2003 SIAM international conference on data mining, pp 59–70, SIAM

[43]

Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: Fourth IEEE International Conference on Data Mining (ICDM’04), pp 563–566, IEEE

[44]

Zhong N, Li Y, and Wu S-T Effective pattern discovery for text mining IEEE transactions on knowledge and data engineering 2012 24 1 30-44

[45]

Zingla MA, Latiri C, Mulhem P, Berrut C, and Slimani Y Hybrid query expansion model for text and microblog information retrieval Information Retrieval Journal 2018 21 4 337-367

[46]

Belhadi A, Djenouri Y, Lin JC-W, Zhang C, and Cano A Exploring pattern mining algorithms for hashtag retrieval problem IEEE Access 2020 8 10569-10583

[47]

Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 436–442, ACM

[48]

Djenouri Y, Belhadi A, Fournier-Viger P, and Lin JC-W Fast and effective cluster-based information retrieval using frequent closed itemsets Inf Sci 2018 453 154-167

[49]

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

[50]

Jain AK, Murty MN, and Flynn PJ Data clustering: a review ACM computing surveys (CSUR) 1999 31 3 264-323

[51]

MacQueen J, et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297, Oakland, CA, USA

[52]

Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: Analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856

[53]

Ester M, Kriegel H-P, Sander J, Xu X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 1996 96 34 226-231

[54]

Zhai C (2017) Probabilistic topic models for text data retrieval and analysis. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pp 1399–1401, ACM

[55]

Shi B, Poghosyan G, Ifrim G, and Hurley N Hashtagger+: Efficient high-coverage social tagging of streaming news IEEE Trans Knowl Data Eng 2018 30 1 43-58

[56]

Makki R, Carvalho E, Soto AJ, Brooks S, Oliveira MCFD, Milios E, and Minghim R Atr-vis: Visual and interactive information retrieval for parliamentary discussions in twitter ACM Transactions on Knowledge Discovery from Data (TKDD) 2018 12 1 3

[57]

Stilo G and Velardi P Hashtag sense clustering based on temporal similarity Computational Linguistics 2017 43 1 181-200

[58]

Djenouri Y, Habbas Z, and Djenouri D Data mining-based decomposition for solving the maxsat problem: toward a new approach IEEE Intell Syst 2017 32 4 48-58

[59]

Djenouri Y, Belhadi A, and Fournier-Viger P Extracting useful knowledge from event logs: a frequent itemset mining approach Knowl-Based Syst 2018 139 132-148

[60]

Djenouri Y, Habbas Z, Djenouri D, and Fournier-Viger P Bee swarm optimization for solving the MAXSAT problem using prior knowledge Soft Comput 2019 23 9 3095-3112

[61]

Djenouri D, Laidi R, Djenouri Y, and Balasingham I Machine learning for smart building applications: Review and taxonomy ACM Computing Surveys (CSUR) 2019 52 2 24

Cited By

Liu DLi L(2024)A node clustering algorithm for heterogeneous information networks based on node embeddingsMultimedia Tools and Applications10.1007/s11042-023-15245-983:2(3745-3766)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15245-9
Inje BNagwanshi KRambola R(2024)An efficient document information retrieval using hybrid global search optimization algorithm with density based clustering techniqueCluster Computing10.1007/s10586-023-03976-127:1(689-705)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10586-023-03976-1
Lo PLim E(2023)A transformer framework for generating context-aware knowledge graph pathsApplied Intelligence10.1007/s10489-023-04588-353:20(23740-23767)Online publication date: 14-Jul-2023
https://dl.acm.org/doi/10.1007/s10489-023-04588-3
Show More Cited By

Index Terms

Cluster-based information retrieval using pattern mining
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Clustering Algorithms for Query Expansion Based Information Retrieval
Computational Collective Intelligence
Abstract
Clustering is by far the most commonly used unsupervised data mining techniques for discovering interesting knowledge and patterns. It aims to group a set of data objects into clusters that are coherent internally but basically different from each ...
Cluster-based retrieval using language models
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine ...
Testing the cluster hypothesis in distributed information retrieval

How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single ...

Comments

Information & Contributors

Information

Published In

cover image Applied Intelligence

Applied Intelligence Volume 51, Issue 4

Apr 2021

874 pages

ISSN:0924-669X

Issue’s Table of Contents

© The Author(s) 2020.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2021

Accepted: 01 September 2020

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu DLi L(2024)A node clustering algorithm for heterogeneous information networks based on node embeddingsMultimedia Tools and Applications10.1007/s11042-023-15245-983:2(3745-3766)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15245-9
Inje BNagwanshi KRambola R(2024)An efficient document information retrieval using hybrid global search optimization algorithm with density based clustering techniqueCluster Computing10.1007/s10586-023-03976-127:1(689-705)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10586-023-03976-1
Lo PLim E(2023)A transformer framework for generating context-aware knowledge graph pathsApplied Intelligence10.1007/s10489-023-04588-353:20(23740-23767)Online publication date: 14-Jul-2023
https://dl.acm.org/doi/10.1007/s10489-023-04588-3
Tolas RPortase RLemnaru CDinsoreanu MPotolea R(2023)Unsupervised Clustering and Explainable AI for Unveiling Behavioral Variations Across Time in Home-Appliance Generated DataInformation Integration and Web Intelligence10.1007/978-3-031-48316-5_17(147-161)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1007/978-3-031-48316-5_17
Liu GDong YWang KYan Z(2022)A cross-lingual sentence pair interaction feature capture model based on pseudo-corpus and multilingual embeddingAI Communications10.3233/AIC-21008535:1(1-14)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/AIC-210085
Alothman AWahab Sait A(2022)Managing and Retrieving Bilingual Documents Using Artificial Intelligence-Based Ontological FrameworkComputational Intelligence and Neuroscience10.1155/2022/46369312022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/4636931
Ma ZZhao HLi LSong L(2022)A Belief Two-Level Weighted Clustering Method for Incomplete Pattern Based on Multiview FusionComputational Intelligence and Neuroscience10.1155/2022/28953382022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2895338
You JHan CRen ZLi HYou X(2022)Clustering via multiple kernel k-means coupled graph and enhanced tensor learningApplied Intelligence10.1007/s10489-022-03679-x53:3(2564-2575)Online publication date: 10-May-2022
https://dl.acm.org/doi/10.1007/s10489-022-03679-x
Li RCai Z(2022)A clustering algorithm based on density decreased chain for data with arbitrary shapes and densitiesApplied Intelligence10.1007/s10489-022-03583-453:2(2098-2109)Online publication date: 5-May-2022
https://dl.acm.org/doi/10.1007/s10489-022-03583-4
Losada DElsweiler DHarvey MTrattner C(2022)A day at the racesApplied Intelligence10.1007/s10489-021-02719-252:5(5617-5632)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s10489-021-02719-2
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents