Estimating Semantic Relatedness in Source Code

Published: 02 December 2015


Contemporary software engineering tools exploit semantic relations between individual code terms to aid in code analysis and retrieval tasks. Such tools employ word similarity methods, often used in natural language processing (nlp), to analyze the textual content of source code. However, the notion of similarity in source code is different from natural language. Source code often includes unnatural domain-specific terms (e.g., abbreviations and acronyms), and such terms might be related due to their structural relations rather than linguistic aspects. Therefore, applying natural language similarity methods to source code without adjustment can produce low-quality and error-prone results. Motivated by these observations, we systematically investigate the performance of several semantic-relatedness methods in the context of software. Our main objective is to identify the most effective semantic schemes in capturing association relations between source code terms. To provide an unbiased comparison, different methods are compared against human-generated relatedness information using terms from three software systems. Results show that corpus-based methods tend to outperform methods that exploit external sources of semantic knowledge. However, due to inherent code limitations, the performance of such methods is still suboptimal. To address these limitations, we propose Normalized Software Distance (nsd), an information-theoretic method that captures semantic relatedness in source code by exploiting the distributional cues of code terms across the system. nsd overcomes data sparsity and lack of context problems often associated with source code, achieving higher levels of resemblance to the human perception of relatedness at the term and the text levels of code.


David A. Gustafson

Analyzing existing source code to identify relationships between classes might be more accurate using the authors' new method, normalized software distance, which is introduced, derived, and analyzed in this paper. Correctly identifying relationships between sections of code could be an essential starting point in automating tools for improving testing, maintenance, and other improvements in large software systems. Natural language processing (NLP) has been used to discover relationships between sections of code; however, source code is very different from natural language. Their proposed metric, normalized software distance (NSD), is shown to be superior to general NLP methods. The authors analyzed a number of currently proposed methods, including latent semantic analysis (LSA) [1], normalized Google distance (NGD) [2], pointwise mutual information (PMI) [3], path-based methods [4,5], information-content methods [6,7,8], and a definition-of-words method [9]. Three software systems were used for comparison. All were in Java. One was a student-developed open-source medical application. It was 47.6 KLOC (thousands of lines of code). The second was a subproject of the Apache Ant project. It was 40.9 KLOC. The third was a financial software package contributed by an industrial partner. Each software system had participants with two or more years of experience with the respective software. The above methods were applied to these three software projects and the results compared to the knowledge of the participants. The evaluation measures were recall analysis and mean average precision. The LSA method was the best of the current methods. The analysis showed a number of issues. The methods based on external sources were not as good as those based on the source code. However, the sparsity and lack of uniqueness in source code were an issue for code-based approaches, including the perfect term dependence (where two terms only appear together). Their approach, NSD, uses a hybrid technique using the class as the level of granularity for the code analysis. The authors provide a theoretical derivation of their approach. The resulting NSD method involves normalizing the maximums of logs of inverses of Bayesian probabilities of one term given the occurrence of the other. The resulting measure, NSD, shows statistically significant improvement over the other methods on these three software projects. NSD also had the lowest time complexity in both pre-processing and relatedness calculations. The paper includes an extensive set of references and a thorough explanation of the experiments, results, derivation of the NSD measure, NSD results, limitations, and related work. This paper would be a good technical introduction to the area of semantic relatedness in source code. Online Computing Reviews Service

Information & Contributors


Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 25, Issue 1
December 2015
339 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2015
Accepted: 01 September 2015
Revised: 01 May 2015
Received: 01 November 2014
Published in TOSEM Volume 25, Issue 1


Author Tags

  1. Semantic relatedness
  2. clustering
  3. information retrieval
  4. information theory
  5. latent semantics


Funding Sources

  • Louisiana Board of Regents Research Competitiveness Subprogram (LABoR-RCS)


