article

Domain-specific cross-language relevant question retrieval

Authors:

Zhenchang Xing,

Shanping LiAuthors Info & Claims

Empirical Software Engineering, Volume 23, Issue 2

Pages 1084 - 1122

https://doi.org/10.1007/s10664-017-9568-3

Published: 01 April 2018 Publication History

Abstract

Chinese developers often cannot effectively search questions in English, because they may have difficulties in translating technical words from Chinese to English and formulating proper English queries. For the purpose of helping Chinese developers take advantage of the rich knowledge base of Stack Overflow and simplify the question retrieval process, we propose an automated cross-language relevant question retrieval (CLRQR) system to retrieve relevant English questions for a given Chinese question. CLRQR first extracts essential information (both Chinese and English) from the title and description of the input Chinese question, then performs domain-specific translation of the essential Chinese information into English, and finally formulates an English query for retrieving relevant questions in a repository of English questions from Stack Overflow. We propose three different retrieval algorithms (word-embedding, word-matching, and vector-space-model based methods) that exploit different document representations and similarity metrics for question retrieval. To evaluate the performance of our approach and investigate the effectiveness of different retrieval algorithms, we propose four baseline approaches based on the combination of different sources of query words, query formulation mechanisms and search engines. We randomly select 80 Java, 20 Python and 20 .NET questions in SegmentFault and V2EX (two Chinese Q&A websites for computer programming) as the query Chinese questions. We conduct a user study to evaluate the relevance of the retrieved English questions using CLRQR with different retrieval algorithms and the four baseline approaches. The experiment results show that CLRQR with word-embedding based retrieval achieves the best performance.

References

[1]

Aceves-Pérez RM, Montes-y Gómez M, Villaseñor-Pineda L (2007) Enhancing cross-language question answering by combining multiple question translations. In: Computational Linguistics and Intelligent Text Processing, Springer, pp 485-493.

Digital Library

[2]

Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval, vol 463. ACM Press, New York.

Digital Library

[3]

Bao L, Lo D, Xia X, Li S (2017) Automated android application permission recommendation. Sci China Inf Sci 60(9):092,110.

[4]

Canfora G, Cerulo L (2005) How software repositories can help in resolving a new change request. STEP 2005:99.

[5]

Cohen J (1988) Statistical power analysis for the behavioral sciences. hilsdale. Lawrence Earlbaum Associates, New Jersey, p 2.

[6]

Cui H, Wen JR, Nie JY, Ma WY (2002) Probabilistic query expansion using query logs. In: Proceedings of the 11th international conference on World Wide Web, ACM, pp 325-332.

Digital Library

[7]

Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013a) Automatic query reformulations for text retrieval in software engineering. In: 2013 35th international conference on software engineering (ICSE), IEEE, pp 842-851.

Digital Library

[8]

Haiduc S, De Rosa G, Bavota G, Oliveto R, De Lucia A, Marcus A (2013b) Query quality prediction and reformulation for source code search: The refoqus tool. In: Proceedings of the 2013 international conference on software engineering, IEEE Press, pp 1307-1310.

Digital Library

[9]

Harkness (2017) Why are some chinese students who have learnt english for years still poor in english? https://goo.gl/7ltMLy.

[10]

Harris ZS (1954) Distributional structure. Word 10(2-3):146-162.

[11]

Hayes JH, Sultanov H, Kong WK, Li W (2011) Software verification and validation research laboratory (svvrl) of the university of kentucky: traceability challenge 2011: language translation. Selabnetlabukyedu pp 50-53.

Digital Library

[12]

Hiemstra D, De Jong F, Kraaij W (1997) A domain specific lexicon acquisition tool for cross-language information retrieval. In: Computer-Assisted Information Searching on Internet, LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE, pp 255-268.

Digital Library

[13]

Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of nl-queries for software maintenance and reuse. In: IEEE 31st international conference on software engineering, 2009. ICSE 2009. IEEE, pp 232-242.

Digital Library

[14]

Hull DA, Grefenstette G (1996) A dictionary-based approach to multilingual informaion retrieval. In: Proceedings of the 19th international conference on research and development in information retrieval, pp 49-57.

Digital Library

[15]

Jones G, Sakai T, Collier N, Kumano A, Sumita K (1999) A comparison of query translation methods for english-japanese cross-language information retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 269-270.

Digital Library

[16]

Jui SL (2010) Innovation in China: the Chinese software industry. Routledge, Abingdon.

[17]

Kluck M, Gey FC (2001a) The domain-specific task of clef - specific evaluation strategies in cross-language information retrieval. In: Peters C. (ed) Proceedings of the CLEF 2000 evaluation forum, pp 48-56.

Digital Library

[18]

Kluck M, Gey FC (2001b) The domain-specific task of clef-specific evaluation strategies in cross-language information retrieval. In: Cross-Language Information Retrieval and Evaluation, Springer, pp 48-56.

Digital Library

[19]

Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-language information retrieval. Comput Linguist 29(3):381-419.

Digital Library

[20]

Liu X, Gong Y, Xu W, Zhu S (2002) Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 191-198.

Digital Library

[21]

Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):50. Acm Transactions on Software Engineering & Methodology 16.

Digital Library

[22]

Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(Nov):2579-2605.

[23]

Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, 2004. Proceedings. IEEE, pp 214-223.

Digital Library

[24]

Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Association for Computational Linguistics.

[25]

Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16-20, 2006, Boston, Massachusetts, USA, pp 775-780.

Digital Library

[26]

Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781.

[27]

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111-3119.

Digital Library

[28]

Peñas A, Magnini B, Forner P, Sutcliffe R, Rodrigo Á, Giampiccolo D (2012) Question answering at the cross-language evaluation forum 2003-2010. Lang Resour Eval 46(2):177-217.

Digital Library

[29]

Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130-137.

[30]

Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420-432.

Digital Library

[31]

Reh¿rek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp 45-50. http://is.muni.cz/publication/884893/en.

[32]

Resnik P, Melamed ID (1997) Semi-automatic acquisition of domain-specific translation lexicons. In: Proceedings of the fifth conference on Applied natural language processing, Association for Computational Linguistics, pp 340-347.

Digital Library

[33]

Saggion H, Radev D, Teufel S, Lam W, Strassel SM (2002) Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. Ann Arbor 1001(48):109-1092.

[34]

Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513-523.

Digital Library

[35]

Shepherd D, Pollock L, Tourwé T (2005) Using language clues to discover crosscutting concerns. Acm Sigsoft Soft Engineer Notes 30:1-6.

Digital Library

[36]

Shepherd D, Fry ZP, Hill E, Pollock L, Vijay-Shanker K (2007) Using natural language program analysis to locate and understand action-oriented concerns. In: Proceedings of the 6th international conference on Aspect-oriented software development, ACM, pp 212-224.

Digital Library

[37]

Tan PN et al (2006) Introduction to data mining. Pearson Education, London.

[38]

Thai P (2007) An introduction to cross-language information retrieval approaches. Web.simmons.edu.

[39]

¿ubranic D, Murphy GC (2003) Hipikat: recommending pertinent software development artifacts. In: 25th international conference on software engineering, 2003. Proceedings. pp 408-418.

Digital Library

[40]

Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80-83. JSTOR.

[41]

Xia X, Lo D (2017) An effective change recommendation approach for supplementary bug fixes. Autom Softw Eng 24(2):455-498. Springer.

Digital Library

[42]

Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd International Conference on Program Comprehension, ACM, pp 275-278.

Digital Library

[43]

Xia X, Lo D, Wang X, Yang X (2015) Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 261-270.

Digital Library

[44]

Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Workshop on Mining Software Repositories, ACM, pp 413-424.

Digital Library

[45]

Xu B, Xing Z, Xia X, Lo D (2017a) Answerbot - automated generation of answer summary to developers technical questions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, IEEE, p Accepted.

Digital Library

[46]

Xu B, Xing Z, Xia X, Lo D, Le XBD (2017b) Xsearch: a domain-specific cross-language relevant question retrieval tool. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ACM, pp 1009-1013.

Digital Library

[47]

Yang J, Tan L (2012) Inferring semantically related words from software context. In: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, IEEE Press, pp 161-170.

Digital Library

[48]

Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 127-137.

[49]

Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981-997.

[50]

Zhang Y, Lo D, Xia X, Le TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 110-121.

[51]

Zhang Y, Lo D, Kochhar PS, Xia X, Li Q, Sun J (2017) Detecting similar repositories on github. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, pp 13-23.

[52]

Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed?-more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 34th International Conference on Software Engineering, IEEE Press, pp 14-24.

Digital Library

Cited By

Sun KRen YKuang HGao HMa XRong GShao DZhang HFilkov VRay BZhou M(2024)AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software ProjectsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695023(519-530)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695023
Gao ZXia XLo DGrundy JZhang XXing Z(2022)I Know What You Are Searching for: Code Snippet Recommendation from Stack Overflow PostsACM Transactions on Software Engineering and Methodology10.1145/355015032:3(1-42)Online publication date: 21-Jul-2022
https://dl.acm.org/doi/10.1145/3550150
Wei MHuang YWang JShin JHarzevili NWang SRoychoudhury ACadar CKim M(2022)API recommendation for machine learning libraries: how far are we?Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3549124(370-381)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3540250.3549124
Show More Cited By

Domain-specific cross-language relevant question retrieval
1. Information systems

Recommendations

Domain-specific cross-language relevant question retrieval
MSR '16: Proceedings of the 13th International Conference on Mining Software Repositories

In software development process, developers often seek solutions to the technical problems they encounter by searching relevant questions on Q&A sites. When developers fail to find solutions on Q&A sites in their native language (e.g., Chinese), they ...
XSearch: a domain-specific cross-language relevant question retrieval tool
ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

During software development process, Chinese developers often seek solutions to the technical problems they encounter by searching relevant questions on Q&A sites. When developers fail to find solutions on Q&A sites in Chinese, they could translate ...
Learning a dual-language vector space for domain-specific cross-lingual question retrieval
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

The lingual barrier limits the ability of millions of non-English speaking developers to make effective use of the tremendous knowledge in Stack Overflow, which is archived in English. For cross-lingual question retrieval, one may use translation-based ...

Comments

Information & Contributors

Information

Published In

cover image Empirical Software Engineering

Empirical Software Engineering Volume 23, Issue 2

April 2018

588 pages

ISSN:1382-3256

Issue’s Table of Contents

Copyright © Copyright © 2018 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 April 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun KRen YKuang HGao HMa XRong GShao DZhang HFilkov VRay BZhou M(2024)AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software ProjectsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695023(519-530)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695023
Gao ZXia XLo DGrundy JZhang XXing Z(2022)I Know What You Are Searching for: Code Snippet Recommendation from Stack Overflow PostsACM Transactions on Software Engineering and Methodology10.1145/355015032:3(1-42)Online publication date: 21-Jul-2022
https://dl.acm.org/doi/10.1145/3550150
Wei MHuang YWang JShin JHarzevili NWang SRoychoudhury ACadar CKim M(2022)API recommendation for machine learning libraries: how far are we?Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3540250.3549124(370-381)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3540250.3549124
Li BYang PZhao HZhang PLiu Z(2022)Hierarchical Sliding Inference Generator for Question-driven Abstractive Answer SummarizationACM Transactions on Information Systems10.1145/351189141:1(1-27)Online publication date: 14-Feb-2022
https://dl.acm.org/doi/10.1145/3511891
Lin JLiu YCleland-Huang J(2021)Information retrieval versus deep learning approaches for generating traceability links in bilingual projectsEmpirical Software Engineering10.1007/s10664-021-10050-027:1Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.1007/s10664-021-10050-0
Gao ZXia XGrundy JLo DLi Y(2020)Generating Question Titles for Stack Overflow from Mined Code SnippetsACM Transactions on Software Engineering and Methodology10.1145/340102629:4(1-37)Online publication date: 26-Sep-2020
https://dl.acm.org/doi/10.1145/3401026
Chen JChen CXing ZXia XZhu LGrundy JWang J(2020)Wireframe-based UI Design Search through Image AutoencoderACM Transactions on Software Engineering and Methodology10.1145/339161329:3(1-31)Online publication date: 16-Jun-2020
https://dl.acm.org/doi/10.1145/3391613

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents