Abstract
Information retrieval systems are used to describe a variety of processes involving the delivery of information to people who need it. Although several mathematical approaches have been studied in order to formalize the main components of an information retrieval system: queries representation, information items representations and the retrieval process, such systems still face many difficulties to extract relevant information for users especially when the processed data are texts. This is due to the complex nature of text databases. Generally, an information retrieval system reformulates queries according to associations among information items before matching them to dataset items. In this sense, semantic relationships or machine learning techniques can be applied to refine the returned results. This paper presents a formal model to organize data, and a new search algorithm to browse it. It incorporates a natural language preprocessing stage, a statistical representation of short documents and queries and a machine learning model to select relevant results. We propose later in this paper two further optimizations that proved quite interesting and returned significantly satisfying results on two datasets in a reasonable computation time. The first optimization concerns queries expansions, while the second one concerns dataset restructuration. Thus, we formally evaluate the impact of each optimization by computing the performance of the information retrieval system with and without it; the highest reached recall and precision were 96.2% and 99.2%, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Yahoo! Webscope dataset ydata-ymusic-user-artist-ratings-v1_0 [http://research.yahoo.com/Academic_Relations].
References
Barreau, D., Nardi, B.A.: Finding and reminding: file organization from the desktop. SIGCHI Bull. 27(3), 329–339 (1995)
Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: ACM SIGIR Forum, vol. 51, no. 2, pp. 219–226. ACM (2017)
Bordogna, G., Carrara, P., Pasi, G.: Query term weights as constraints in fuzzy information retrieval. Inf. Process. Manage. 27(1), 15–26 (1991)
Cai, F., De Rijke, M.: A survey of query auto completion in information retrieval. Found. Trends Inf. Retr. 10(4), 273–363 (2016)
Cai, F., Liang, S., De Rijke, M.: Personalized document re-ranking based on bayesian probabilistic matrix factorization, pp. 835–838. SIGIR, ACM (2014)
Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)
Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 44(1), 1 (2012)
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Cherif, W., Madani, A., Kissi, M.: New rules-based algorithm to improve Arabic stemming accuracy. Int. J. Knowl. Eng. Data Mining 3(3–4), 315–336 (2015)
Cherif, W., Madani, A., Kissi, M.: Towards an efficient opinion measurement in Arabic comments. Proc. Comput. Sci. 73, 122–129 (2015)
Cherif, W.: Optimization of K-NN algorithm by clustering and reliability coefficients: application to breast-cancer diagnosis. Proc. Comput. Sci. 127, 293–299 (2018)
Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781 (2016)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)
Dumais, S., Cutrell, E., Cadiz, J.J., Jancke, G., Sarin, R., Robbins, D.C.: Stuff I’ve seen: a system for personal information retrieval and re-use. In ACM SIGIR Forum, vol. 49, no. 2, pp. 28–35. ACM (2016)
El Ghali, B., El Qadi, A.: Context-aware query expansion method using language models and latent semantic analyses. Knowl. Inf. Syst. 50(3), 751–762 (2017)
Erickson, T.: The design and long-term use of a personal electronic notebook: a reflective analysis. In: Proceedings of CHI’96, pp. 11–18 (1996)
Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human-system communication. Commun. ACM 30(11), 964–971 (1987)
Ghorab, M.R., Zhou, D., O’connor, A., Wade, V.: Personalised information retrieval: survey and classification. User Model. User-Adap. Inter. 23(4), 381–443 (2013)
Harper, D.J., Van Rijsbergen, C.J.: An evaluation of feedback in document retrieval using co-occurrence data. J. Doc. 34(3), 189–216 (1978)
Hattie, J.: Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to Achievement. Routledge, London (2008)
Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR Forum, vol. 51, no. 2, pp. 211–218. ACM (2017)
Jain, A., Mishne, G.: Organizing query completions for web search. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 1169–1178. ACM (2010)
Jones, S.R., Thomas, P.J.: Empirical assessment of individuals’ ‘personal information management systems’. Behav. Inf. Technol. 16(3), 158–160 (1997)
Jones. W.P., Dumais, S.T., Bruce, H.: Once found, what then? A study of “Keeping” behaviors in the personal use of web information. In: Proceedings of ASIST, pp. 391–402 (2002)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759. (2016)
Khalifi, H., Elqadi, A., Ghanou, Y.: Support Vector Machines for a new Hybrid Information Retrieval System. Proc. Comput. Sci. 127(C), 139–145 (2018)
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)
Ko, Y.: How to use negative class information for Naive Bayes classification. Inf. Process. Manage. 53(6), 1255–1268 (2017)
Krishnamurthy, S., Akila, V.: Information retrieval models: trends and techniques. In: Web Semantics for Textual and Visual Information Retrieval, pp. 17–42. IGI Global (2017)
Labjar, H., Cherif, W., Nadir, S., Digua, K., Sallek, B., Chaair, H.: Support vector machines for modelling phosphocalcic hydroxyapatite by precipitation from a calcium carbonate solution and phosphoric acid solution. J. Taibah Univ. Sci. 10(5), 745–754 (2016)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)
Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In European Conference on Machine Learning, pp. 4–15. Springer, Berlin, Heidelberg (1998)
Lewis, D.D.: Learning in intelligent information retrieval. In: Machine Learning: Proceedings of the Eighth International Workshop, pp. 235–239 (2014)
Li, B., Han, L.: Distance weighted cosine similarity measure for text classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 611–618. Springer, Berlin, Heidelberg (2013)
Lu, Y., Hsiao, I.H.: Personalized Information Seeking Assistant (PiSA): from programming information seeking to learning. Inf. Retr. J. 20(5), 433–455 (2017)
Malone, T.: How do people organize their desks? Implications for the design of office information systems. ACM Trans. Office Inf. Syst. 1(1), 99–112 (1983)
Mao, R., Chen, G., Li, R., & Lin, C.: ABDN at SemEval-2018 Task 10: recognising discriminative attributes using context embeddings and WordNet. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 1017–1021 (2018)
Marais, H., Bharat, K.: Supporting cooperative and personal surfing with a desktop assistant. Proc. UIST 1997, 129–138 (1997)
Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized search on the world wide web. In: The adaptive web, pp. 195–230. Springer, Berlin, Heidelberg (2007)
Moniz, N., Torgo, L.: Multi-Source Social Feedback of Online News Feeds. arXiv preprint arXiv:1801.07055 (2018)
Nie, J.: An information retrieval model based on modal logic. Inf. Process. Manage. 25(5), 477–491 (1989)
Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R.: Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(4), 694–707 (2016)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: ACM SIGIR Forum, vol. 51, no. 2, pp. 202–208. ACM (2017)
Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
Qu, Z., Song, X., Zheng, S., Wang, X., Song, X., Li, Z.: Improved Bayes method based on TF-IDF feature and grade factor feature for Chinese information classification. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 677–680. IEEE (2018)
Rajman, M., Vesely, M.: From text to knowledge: document processing and visualization: a text mining approach. In: Text mining and its applications, pp. 7–24. Springer, Berlin, Heidelberg (2004)
Rhodes, B., Starner, T.: Remembrance agent: a continuously running automated information retrieval system. In: The Proceedings of the First International Conference on The Practical Application Of Intelligent Agents and Multi Agent Technology, pp. 487–495 (1996)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Silvestri, F.: Mining query logs: turning search usage data into knowledge. Foundations and Trends® in Information Retrieval, 4(1–2), 1-174. (2009)
Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 101–110. ACM (2014)
Smits, G.F., Jordaan, E.M.: Improved SVM regression using mixtures of kernels. In: Proceedings of the 2002 International Joint Conference on Neural Networks, 2002. IJCNN’02, vol. 3, pp. 2785–2790. IEEE (2002)
Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)
UtreraSust, E., Simon-Cuevas, A., Olivas, J.A., Romero, F.P.: An approach of a personalized information retrieval model based on contents semantic analysis. Procesamiento del lenguaje natural 61, 31–38 (2018)
Vapnik, V., Mukherjee, S.: Support vector method for multivariate density estimation. In: Advances in Neural Information Processing Systems, pp. 659–665 (2000)
Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K.: Probability and Statistics for Engineers and Scientists, vol. 5. Macmillan, New York (1993)
Whittaker, S., & Sidner, C.: Email overload: exploring personal information management of email. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 276-283). ACM. (1996)
Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. Proceedings of the VLDB Endowment 6(6), 373–384 (2013)
Yin, Z., Shokouhi, M., & Craswell, N.: Query Expansion Using External Evidence. In ECIR (Vol. 9, pp. 362-374). (2009)
Zhai, C., & Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 268-276). ACM. (2017)
Zhang, X., Zhao, J., & LeCun, Y.: Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657) (2015)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khalifi, H., Cherif, W., Qadi, A.E. et al. Query expansion based on clustering and personalized information retrieval. Prog Artif Intell 8, 241–251 (2019). https://doi.org/10.1007/s13748-019-00178-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-019-00178-y