Abstract
Motivation: Code search is an important activity in software development since developers are regularly searching [6] for code examples dealing with diverse programming concepts, APIs, and specific platform peculiarities. To help developers search for source code, several Internet-scale code search engines, such as OpenHub [5] and Codota [1] have been proposed. Unfortunately, these Internet-scale code search engines have limited performance since they treat source code as natural language documents. To improve the performance of search engines, the construction of the search space index as well as the mapping process of querying must address the challenge that "no single word can be chosen to describe a programming concept in the best way" [2]. This is known in the literature as the vocabulary mismatch problem [3].
Approach: We propose a novel approach to augmenting user queries in a free-form code search scenario. This approach aims at improving the quality of code examples returned by Internet-scale code search engines by building a Code voCaBulary (CoCaBu) [7]. The originality of CoCaBu is that it addresses the vocabulary mismatch problem, by expanding/enriching/re-targeting a user's free-form query, building on similar questions in Q&A sites so that a code search engine can find highly relevant code in source code repositories. Figure 1 provides an overview of our approach.
The search process begins with a free-form query from a user,
i.e., a sentence written in a natural language:
(a) For a given query, CoCaBu first searches for relevant posts in Q&A forums. The role of the Search Proxy is then to forward developer free-form queries to web search engines that can collect and rank entries in Q&A with the most relevant documents for the query.
(b) CoCaBu then generates an augmented query based on the information in the relevant posts. It mainly leverages code snippets in the previously identified posts. The Code Query Generator then creates another query which includes not only the initial user query terms but also program elements. To accelerate this step in the search process, CoCaBu builds upfront a snippet index for Q&A posts.
(c) Once the augmented query is constructed, CoCaBu searches source files for code locations that match the query terms. For this step, we crawl a large number of repositories and build upfront a code index of program elements in the source code.
Contributions:
• CoCaBu approach to the vocabulary mismatch problem: We propose a technique for finding relevant code with freeform query terms that describe programming tasks, with no a-priori knowledge on the API keywords to search for.
• GitSearch free-form search engine for GitHub: We instantiate the CoCaBu approach based on indices of Java files built from GitHub and Q&A posts from Stack Overflow to find the most relevant code examples for developer queries.
• Empirical user evaluation: Comparison with popular code search engines further shows that GitSearch is more effective in returning acceptable code search results. In addition, Comparison against web search engines indicates that GitSearch is a competitive alternative. Finally, via a live study, we show that users on Q&A sites may find GitSearch's real code examples acceptable as answers to developer questions.
Concluding remarks: As a follow-up work, we have also leveraged Stack Overflow data to build a practical, novel, and efficient code-to-code search engine [4].