Abstract
The semantic-based method for question retrieval is an important method for searching similar questions in community question answering (CQA). The major challenges in question retrieval lie in polysemy and lexical gaps between questions, and the quality of retrieved similar questions by semantic retrieval model might not be high enough to effectively solve one’s doubts. In order to address these challenges, a high-quality and multi-level semantic analysis-based similar question retrieval framework named HQML-QR is proposed, which consists of semantic representation from tag-level and sentence-level semantics for question retrieval (TS-QR) and multi-dimensional quality analysis (MDQQ). Firstly, TS-QR extracts multi-level semantic features of the question contents, where graph embedding model is utilized to learn coarse-grained semantics of questions from the scope of the tag. Meanwhile, in order to effectively identify polysemy and extract fine-grained sentence semantic of questions, TS-QR integrates the pre-trained language model based on self-attention mechanism to ensure the accuracy of question retrieval. Secondly, based on the quality factors in CQA (i.e., popularity, question, answer and user), MDQQ constructs a multi-dimensional quality evaluation model to provide a reasonable quality measurement standard for questions. Under the guidance of the quality of questions, the similarity score obtained by semantic vector matching is updated to retrieve high-quality and semantically similar questions. Finally, experiments are executed on CQADupStack dataset from Stack Overflow and the experimental results show that the P@N of HQML-QR has an average increase of 5.65%, 4.44% and 4.34% compared with LDA-VSM-SEM, WET-QR, RCM-QR, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data are openly available in a public repository. The Program dataset that supports the findings of this study is openly available at https://archive.org/details/stackexchange. The Stack overflow dataset that supports the findings of this study is openly available at http://nlp.cis.unimelb.edu.au/resources/cqadupstack/.
References
Qu M, Qiu G, He X, Zhang C, Wu H, Bu J, Chen C (2009) Probabilistic question recommendation for question answering communities. In: Proceedings of the 18th International Conference on World Wide Web, pp 1229–1230
Jeon J, Croft WB, Lee JH (2005) Finding similar questions in large question and answer archives. In: Proceedings of the 2005 ACM CIKM international conference on information and knowledge management, pp 84–90
Zhao J, Guan Z, Sun H (2019) Riker: Mining rich keyword representations for interpretable product question answering. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1389–1398
Chen Z, Zhang C, Zhao Z, Yao C, Cai D (2018) Question retrieval for community-based question answering via heterogeneous social influential network. Neurocomputing 285:117–124
Othman N, Faiz R, Smaïli K (2020) Improving the community question retrieval performance using attention-based siamese LSTM. In: Natural Language Processing and Information Systems—25th International Conference on Applications of Natural Language to Information Systems, vol 12089, pp 252–263. Springer, New York
Liu Y, Tang A, Sun Z, Tang W, Cai F, Wang C (2020) An integrated retrieval framework for similar questions: word-semantic embedded label clustering - LDA with question life cycle. Inf Sci 537:227–245
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st International conference on learning representations
Zhang K, Wu W, Wu H, Li Z, Zhou M (2014) Question retrieval with high quality answers in community question answering. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pp 371–380
Lee J, Kim S, Song Y, Rim H (2008) Bridging lexical gaps between queries and questions on large online q &a collections with compact translation models. In: 2008 conference on empirical methods in Natural Language Processing. ACL, pp 410–418
Zhou G, Cai L, Zhao J, Liu K (2011) Phrase-based translation model for question retrieval in community question answer archives. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pp 653–662
Cai L, Zhou G, Liu K, Zhao J (2011) Learning the latent topics for question retrieval in community QA. In: Fifth international joint conference on Natural Language Processing, pp 273–281
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. Journal of machine Learning research 3(Jan):993–1022
Liu M, Fang Y, Choulos AG, Park DH, Hu X (2017) Product review summarization through question retrieval and diversification. Inf. Retr. J. 20(6):575–605
Zhou G, He T, Zhao J, Hu P Learning continuous word embedding with metadata for question retrieval in community question answering. In: Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pp 250–259
Li B, Du X, Chen M (2020) Cross-language question retrieval with multi-layer representation and layer-wise adversary. Inf Sci 527:241–252
Shen Y, Rong W, Sun Z, Ouyang Y, Xiong Z (2015) Question/answer matching for CQA system via combining lexical and sequential information. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, pp 275–281
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the international conference on Web Search and Web Data Mining, pp 183–194
Bian J, Liu Y, Agichtein E, Zha H (2008) Finding the right facts in the crowd: factoid question answering over social media. In: Proceedings of the 17th international conference on World Wide Web, pp 467–476
Sakai T, Ishikawa D, Kando N, Seki Y, Kuriyama K, Lin C (2011) Using graded-relevance metrics for evaluating community QA answer selection. In: Proceedings of the forth international conference on Web Search and Web Data Mining, pp 187–196
Shah C, Pomerantz J (2010) Evaluating and predicting answer quality in community QA. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, pp 411–418
Ghasemi N, Fatourechi R, Momtazi S (2021) User embedding for expert finding in community question answering. ACM Trans Knowl Discov Data 15(4):70–17016
Liu Y, Tang W, Liu Z, Ding L, Tang A (2022) High-quality domain expert finding method in CQA based on multi-granularity semantic analysis and interest drift. Inf Sci 596:395–413
Li B, Jin T, Lyu MR, King I, Mak B (2012) Analyzing and predicting question quality in community question answering services. In: Proceedings of the 21st World Wide Web conference, pp 775–782
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: The 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 701–710
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4171–4186
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Hoogeveen D, Wang L, Baldwin T, Verspoor KM (2018) Web forum retrieval and text analytics: a survey. Found Trends Inf Retr 12(1):1–163
Li Z, Jiang J, Sun Y, Wang W (2019) Personalized question routing via heterogeneous network embedding. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, pp 192–199
Ravi S, Pang B, Rastogi V, Kumar R (2014) Great question! question quality in community q &a. In: Adar E, Resnick P, Choudhury MD, Hogan B, Oh A (eds) Proceedings of the eighth international conference on Weblogs and Social Media
Calinski Harabasz (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3:1–27
Xiong D, Wang J, Lin H (2012) An lda-based approach to finding similar questions for community question answer. J Chin Inform Process 26(5):40–45
Othman N, Faiz R, Smaïli K (2018) Using word embeddings to retrieve semantically similar questions in community question answering. J Int Sci Gen Appl 1(1)
Lei T, Joshi H, Barzilay R, Jaakkola TS, Tymoshenko K, Moschitti A, Màrquez L (2016) Semi-supervised question retrieval with gated convolutions. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 1279–1289
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos. 52073169 and 92270124). We appreciate the support of the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System.
Author information
Authors and Affiliations
Contributions
Yue Liu was involved in the conceptualization, methodology, validation, formal analysis, writing—original draft, writing—review and editing, and supervision. Weize Tang contributed to the methodology, software, validation, data curation, writing—original draft, and writing—review and editing. Zitu Liu contributed to the software, validation, writing—original draft, writing—review and editing. Aihua Tang assisted in the methodology, formal analysis, and writing—original draft. Lipeng Zhang performed formal analysis and writing—review and editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical and informed consent
The experimental dataset used in our research consists of publicly available datasets.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Tang, W., Liu, Z. et al. Similar question retrieval with incorporation of multi-dimensional quality analysis for community question answering. Neural Comput & Applic 36, 3663–3679 (2024). https://doi.org/10.1007/s00521-023-09266-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09266-6