Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Estimating Semantic Relatedness in Source Code

Published: 02 December 2015 Publication History

Abstract

Contemporary software engineering tools exploit semantic relations between individual code terms to aid in code analysis and retrieval tasks. Such tools employ word similarity methods, often used in natural language processing (nlp), to analyze the textual content of source code. However, the notion of similarity in source code is different from natural language. Source code often includes unnatural domain-specific terms (e.g., abbreviations and acronyms), and such terms might be related due to their structural relations rather than linguistic aspects. Therefore, applying natural language similarity methods to source code without adjustment can produce low-quality and error-prone results. Motivated by these observations, we systematically investigate the performance of several semantic-relatedness methods in the context of software. Our main objective is to identify the most effective semantic schemes in capturing association relations between source code terms. To provide an unbiased comparison, different methods are compared against human-generated relatedness information using terms from three software systems. Results show that corpus-based methods tend to outperform methods that exploit external sources of semantic knowledge. However, due to inherent code limitations, the performance of such methods is still suboptimal. To address these limitations, we propose Normalized Software Distance (nsd), an information-theoretic method that captures semantic relatedness in source code by exploiting the distributional cues of code terms across the system. nsd overcomes data sparsity and lack of context problems often associated with source code, achieving higher levels of resemblance to the human perception of relatedness at the term and the text levels of code.

References

[1]
Charu Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.
[2]
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics. 19--27.
[3]
Nicolas Anquetil, Cédric Fourrier, and Timothy Lethbridge. 1999. Experiments with clustering as a software remodularization method. In Proceedings of the Working Conference on Reverse Engineering. 235--255.
[4]
Nicolas Anquetil and Timothy Lethbridge. 1998. Assessing the relevance of identifier names in a legacy software system. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research. 4--14.
[5]
Nicolas Anquetil and Timothy Lethbridge. 2003. Comparative study of clustering algorithms and abstract representations for software remodularisation. IEE Softw. 150, 3 (2003), 185--201.
[6]
Javed Aslam, Emine Yilmaz, and Virgiliu Pavlu. 2005. A geometric interpretation of r-precision and its correlation with average precision. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 573--574.
[7]
Gabriele Bavota, Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. 2014. Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans. Softw. Eng. 4, 7 (2014), 671--694.
[8]
Fabian Beck and Stephan Diehl. 2013. On the impact of software evolution on software clustering. Empirical Softw. Eng. 18, 5 (2013), 970--1004.
[9]
Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on World Wide Web. 757--766.
[10]
Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguistics 32, 1 (2006), 13--47.
[11]
Raluca Budiu, Christiaan Royer, and Peter Pirolli. 2007. Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In Proceedings of the International Conference on Large Scale Semantic Access to Content (Text, Image, Video, and Sound). 314--332.
[12]
John Bullinaria and Joseph Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behav. Res. Method. 39, 3 (2007), 510--526.
[13]
Michael Cafarella and Oren Etzioni. 2005. A search engine for natural language applications. In Proceedings of the International Conference on World Wide Web. 442--452.
[14]
Bruno Caprile and Paolo Tonella. 2000. Restructuring program identifier names. In Proceedings of the International Conference on Software Maintenance. 97--107.
[15]
Ping Chen and Shi Lin. 2010. Automatic keyword prediction using Google similarity distance. Expert Syst. Appl. 37, 3 (2010), 1928--1938.
[16]
Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Comput. Linguistics 16, 1 (1990), 22--29.
[17]
Rudi Cilibrasi and Paul Vitanyi. 2007. The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3 (2007), 370--383.
[18]
Jane Cleland-Huang, Raffaella Settimi, Chuan Duan, and Xuchang Zou. 2005. Utilizing supporting evidence to improve dynamic requirements traceability. In Proceedings of the International Conference on Requirements Engineering. 135--144.
[19]
Tathagata Dasgupta, Mark Grechanik, Evan Moritz, Bogdan Dit, and Denys Poshyvanyk. 2013. Enhancing software traceability by automatically expanding corpora with relevant documentation. In Proceedings of the International Conference on Software Maintenance. 320--329.
[20]
John Davey and Elizabeth Burd. 2000. Evaluating the suitability of data clustering for software remodularization. In Proceedings of the Working Conference on Reverse Engineering. 268--277.
[21]
Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichelle. 2012. Using IR methods for labeling source code artifacts: Is it worthwhile? In Proceedings of the International Conference on Program Comprehension. 193--202.
[22]
Andrea De Lucia, Rocco Oliveto, and Genoveffa Tortora. 2009. Assessing IR-based traceability recovery tools through controlled experiments. Empirical Softw. Eng. 14, 1 (2009), 57--92.
[23]
Angela Dean and Daniel Voss. 1999. Design and Analysis of Experiments. Springer.
[24]
Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 6 (1990), 391--407.
[25]
Serge Demeyer, Stéphane Ducasse, and Oscar Nierstrasz. 2003. Object-Oriented Reengineering Patterns. Elsevier.
[26]
James Demmel and William Kahan. 1990. Accurate singular values of bidiagonal matrices. J. Sci. Stat. Comput. 11, 5 (1990), 873--912.
[27]
Letha Etzkorn, Carl Davis, and Lisa Bowen. 2001. The language of comments in computer software: A sublanguage of English. J. Pragmatics 33, 11 (2001), 1731--1756.
[28]
Jean Falleri, Marianne Huchard, Mathieu Lafourcade, Clementine Nebut, Violaine Prince, and Michel Dao. 2010. Automatic extraction of a WordNet-like identifier network from software. In Proceedings of the International Conference on Program Comprehension. 4--13.
[29]
C. Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
[30]
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Trans. Inform. Syst. 20, 1 (2002), 116--131.
[31]
Peter Foltz, Darrell Laham, and Thomas Landauer. 1999. The intelligent essay assessor: Applications to educational technology. Interact. Multimedia Educ. J. Comput. Enhanced Learn. 1, 2 (1999), 1--28.
[32]
Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison--Wesley.
[33]
Takuya Funahashi and Hayato Yamana. 2010. Reliability verification of search engines' hit counts: How to select a reliable hit count for a query. In Proceedings of the International Conference on Current Trends in Web Engineering. 114--125.
[34]
Mark Gabel and Su Zhendong. 2010. A study of the uniqueness of source code. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering. 147--156.
[35]
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the International Joint Conference on Artifical Intelligence. 1606--1611.
[36]
Jorge Gracia and Eduardo Mena. 2008. Web-based measure of semantic relatedness. In Proceedings of the AAAI Workshop on Wikipedia and Artificial Intelligence. 136--150.
[37]
Jorge Gracia, Raquel Trillo, Mauricio Espinoza, and Eduardo Mena. 2006. Querying the web: A multiontology disambiguation method. In Proceedings of the International Conference on Web Engineering. 241--248.
[38]
Scott Grant and James Cordy. 2010. Estimating the optimal number of latent concepts in source code analysis. In Proceedings of the International Working Conference on Source Code Analysis and Manipulation. 65--74.
[39]
Mark Grechanik, Collin McMillan, Tathagata Dasgupta, Denys Poshyvanyk, and Malcom Gethers. 2014. Redacting sensitive information in software artifacts. In Proceedings of the International Conference on Program Comprehension. 314--325.
[40]
Weiwei Guo, Hao Li, Heng Ji, and Mona Diab. 2013. Linking tweets to news: A framework to enrich short text data in social media. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 239--249.
[41]
Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the use of automated text summarization techniques for summarizing source code. In Proceedings of the Working Conference on Reverse Engineering. 35--44.
[42]
Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the International Conference on Software Engineering. 842--851.
[43]
Marti Hearst and Jan Pedersen. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 76--84.
[44]
Emily Hill, Zachary Fry, Haley Boyd, Giriprasad Sridhara, Yana Novikova, Lori Pollock, and K. Vijay-Shanker. 2008. AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In Proceedings of the International Working Conference on Mining Software Repositories. 79--88.
[45]
Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Proceedings of the International Conference on Software Maintenance. 837--847.
[46]
Andreas Holzinger, Pinar Yildirim, Michael Geier, and Klaus-Martin Simonic. 2013. Quality-based knowledge discovery from medical text on the web. In Quality Issues in the Management of Web Information, Gabriella Pasi, Gloria Bordogna, and Lakhmi Jain (Eds.), Springer, Berlin Heidelberg, 145--158.
[47]
Matthew Howard, Samir Gupta, Lori Pollock, and K. Vijay-Shanker. 2013. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proceedings of the Working Conference on Mining Software Repositories. 377--386.
[48]
Jane Huffman-Hayes, Alex Dekhtyar, and Senthil Sundaram. 2006. Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans. Softw. Eng. 32, 1 (2006), 4--19.
[49]
Aminul Islam and Diana Inkpen. 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 2 (2008), 1--25.
[50]
Md. Islam and Diana Inkpen. 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation. 1033--1038.
[51]
Jay Jiang and David Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics. 19--33.
[52]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28, 7 (2002), 654--670.
[53]
Slava Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics, Speech Signal Process. 35, 3 (1987), 400--401.
[54]
Adrian Kuhn, Stéphane Ducasse, and Tudor Gîrba. 2007. Semantic clustering: Identifying topics in source code. Inform. Softw. Technol. 49, 3 (2007), 230--243.
[55]
Thomas Landauer and Susan Dutnais. 1997. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 2 (1997), 211--240.
[56]
Dawn Lawrie, Henry Feild, and David Binkley. 2007a. Extracting meaning from abbreviated identifiers. In Proceedings of the International Working Conference on Source Code Analysis and Manipulation. 213--222.
[57]
Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007b. Effective identifier names for comprehension and memory. Innov. Syst. Softw. Eng. 4, 3 (2007), 303--318.
[58]
Claudia Leacock and Martin Chodorow. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. MIT Press, 265--283.
[59]
M. Lehman. 1984. On understanding laws, evolution, and conservation in the large-program life cycle. J. Syst. Softw. 1, 3 (1984), 213--221.
[60]
Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the Annual International Conference on Systems Documentation. 24--26.
[61]
Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning. 296--304.
[62]
Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, and Jane Cleland-Huang. 2013. Improving trace accuracy through data-driven configuration and composition of tracing features. In Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. 378--388.
[63]
Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Method. Instrum. Comput. 28, 2 (1996), 203--208.
[64]
Anas Mahmoud and Nan Niu. 2015. On the role of semantics in automated requirements tracing. Require. Eng. 20, 3 (2015), 281--300.
[65]
Jonathan Maletic and Naveen Valluri. 1999. Automatic software clustering via latent semantic analysis. In Proceedings of the International Conference on Automated Software Engineering. 251--254.
[66]
Onaiza Maqbool and Haroon Babri. 2007. Hierarchical clustering for software architecture recovery. IEEE Trans. Softw. Eng. 33, 11 (2007), 759--780.
[67]
Andrian Marcus and Jonathan Maletic. 2001. Identification of high-level concept clones in source code. In Proceedings of the International Conference on Automated Software Engineering. 107--114.
[68]
Andrian Marcus and Jonathan Maletic. 2003. Recovering documentation-to-source-code traceability links using latent semantic indexing. In Proceedings of the International Conference on Software Engineering. 125--135.
[69]
Andrian Marcus, Denys Poshyvanyk, and Rudolf Ferenc. 2008. Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans. Softw. Eng. 34, 2 (2008), 287--300.
[70]
Rada Mihalcea, Courtney Corley, and Carlo Strapparava. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the National Conference on Artificial Intelligence. 775--780.
[71]
George Miller and Walter Charles. 1991. Contextual correlates of semantic similarity. Lang. Cognitive Process. 6, 1 (1991), 1--28.
[72]
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. 100--108.
[73]
Nan Niu and Anas Mahmoud. 2012. Enhancing candidate link generation for requirements tracing: The cluster hypothesis revisited. In Proceedings of the IEEE International Requirements Engineering Conference. 81--90.
[74]
Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. 2010. On the equivalence of information retrieval methods for automated traceability link recovery. In Proceedings of the International Conference on Program Comprehension. 68--71.
[75]
Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, and Andrea De Lucia. 2013. How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In Proceedings of the International Conference on Software Engineering. 522--531.
[76]
Lori Pollock, K. Vijay-Shanker, Emily Hill, Giriprasad Sridhara, and David Shepherd. 2013. Natural Language-Based Software Analyses and Tools for Software Maintenance. Lecture Notes in Computer Science, Vol. 7171. Springer, Berlin Heidelberg, 94--125.
[77]
Denys Poshyvanyk, Andrian Marcus, Vaclav Rajlich, Yann-Gael Gueheneuc, and Giuliano Antoniol. 2006. Combining probabilistic ranking and latent semantic indexing for feature identification. In Proceedings of the IEEE International Conference on Program Comprehension. 137--148.
[78]
Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the International Conference on World Wide Web. 337--346.
[79]
Gabriel Recchia and Michael Jones. 2009. More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. Behav. Res. Method. 41, 3 (2009), 647--656.
[80]
Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the International Joint Conference on Artificial Intelligence. 448--453.
[81]
B. Rosario. 2000. Latent Semantic Indexing: An Overview. INFOSYS 240 Spring Paper, University of California, Berkeley.
[82]
G. Salton, A. Wong, and C. Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.
[83]
Giuseppe Scanniello and Andrian Marcus. 2011. Clustering support for static concept location in source code. In Proceedings of the International Conference on Program Comprehension. 1--10.
[84]
David Shepherd, Zachary Fry, Emily Hill, Lori Pollock, and K. Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the International Conference on Aspect-Oriented Software Development. 212--224.
[85]
Mark Shtern and Vassilios Tzerpos. 2011. Evaluating software clustering using multiple simulated authoritative decompositions. In Proceedings of the International Conference on Software Maintenance. 353--361.
[86]
R. Sibson. 1973. SLINK: An optimally efficient algorithm for the single-link cluster method. Comput. J. 16, 1 (1973), 30--34.
[87]
Noam Slonim and Naftali Tishby. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 208--215.
[88]
Harry Sneed. 1996. Object-oriented COBOL recycling. In Proceedings of the Working Conference on Reverse Engineering. 169--178.
[89]
Daniel Sousa, Luís Sarmento, and Eduarda Mendes Rodrigues. 2010. Characterization of the Twitter replies network: Are user ties social or topical? In Proceedings of the International Workshop on Search and Mining User-generated Contents. 63--70.
[90]
Karen Sparck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 1 (1972), 11--21.
[91]
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K. Vijay-Shanker. 2010. Towards automatically generating summary comments for Java methods. In Proceedings of the International Conference on Automated Software Engineering. 43--52.
[92]
Giriprasad Sridhara, Emily Hill, Lori Pollock, and K. Vijay-Shanker. 2008. Identifying word relations in software: A comparative study of semantic similarity tools. In Proceedings of the IEEE International Conference on Program Comprehension. 123--132.
[93]
Michael Strube and Simone Ponzetto. 2006. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the National Conference on Artificial Intelligence. 1419--1424.
[94]
Mike Thelwall. 2008. Extracting accurate and complete results from search engines: Case study windows live. J. Am. Soc. Inform. Sci. Technol. 59, 1 (2008), 38--50.
[95]
Yuan Tian, David Lo, and Julia Lawall. 2014. Automated construction of a software-specific word similarity database. In Proceedings of the IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering. 44--53.
[96]
Peter Turney. 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the European Conference on Machine Learning. 491--502.
[97]
Vassilios Tzerpos and R. Holt. 2000. ACDC: An algorithm for comprehension-driven clustering. In Proceedings of the Working Conference on Reverse Engineering. 258--267.
[98]
C. J. van Rijsbergen. 1979. Information Retrieval. Butterworths.
[99]
Vladislav Veksler, Ryan Govostes, and Wayne Gray. 2008. Defining the dimensions of the human semantic space. In Proceedings of the Annual Meeting of the Cognitive Science Society. 1282--1287.
[100]
Xing Wei and B. Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 178--185.
[101]
Zhihua Wen and Vassilios Tzerpos. 2003. An optimal algorithm for MoJo distance. In Proceedings of the IEEE International Workshop on Program Comprehension. 227--235.
[102]
Zhihua Wen and Vassilios Tzerpos. 2004. An effectiveness measure for software clustering algorithms. In Proceedings of the International Workshop on Program Comprehension. 194--203.
[103]
Wei Lee Woon and Stuart Madnick. 2009. Asymmetric information distances for automated taxonomy construction. Knowl. Inform. Syst. 21, 1 (2009), 91--111.
[104]
Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 133--138.
[105]
Zheng Xiang, Karl Wöber, and Daniel Fesenmaier. 2008. Representation of the online tourism domain in search engines. J. Travel Res. 47, 2 (2008), 137--150.
[106]
Jinqiu Yang and Tan Lin. 2014. SWordNet: Inferring semantically related words from software context. Empirical Softw. Eng. 19, 6 (2014), 1856--1886.

Cited By

View all
  • (2023)Safe Maintenance of Railways using COTS Mobile Devices: The Remote Worker DashboardACM Transactions on Cyber-Physical Systems10.1145/36071937:4(1-20)Online publication date: 14-Oct-2023
  • (2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
  • (2023)Formal Specification, Verification and Repair of Contiki’s SchedulerACM Transactions on Cyber-Physical Systems10.1145/36059487:4(1-28)Online publication date: 14-Oct-2023
  • Show More Cited By

Recommendations

Reviews

David A. Gustafson

Analyzing existing source code to identify relationships between classes might be more accurate using the authors' new method, normalized software distance, which is introduced, derived, and analyzed in this paper. Correctly identifying relationships between sections of code could be an essential starting point in automating tools for improving testing, maintenance, and other improvements in large software systems. Natural language processing (NLP) has been used to discover relationships between sections of code; however, source code is very different from natural language. Their proposed metric, normalized software distance (NSD), is shown to be superior to general NLP methods. The authors analyzed a number of currently proposed methods, including latent semantic analysis (LSA) [1], normalized Google distance (NGD) [2], pointwise mutual information (PMI) [3], path-based methods [4,5], information-content methods [6,7,8], and a definition-of-words method [9]. Three software systems were used for comparison. All were in Java. One was a student-developed open-source medical application. It was 47.6 KLOC (thousands of lines of code). The second was a subproject of the Apache Ant project. It was 40.9 KLOC. The third was a financial software package contributed by an industrial partner. Each software system had participants with two or more years of experience with the respective software. The above methods were applied to these three software projects and the results compared to the knowledge of the participants. The evaluation measures were recall analysis and mean average precision. The LSA method was the best of the current methods. The analysis showed a number of issues. The methods based on external sources were not as good as those based on the source code. However, the sparsity and lack of uniqueness in source code were an issue for code-based approaches, including the perfect term dependence (where two terms only appear together). Their approach, NSD, uses a hybrid technique using the class as the level of granularity for the code analysis. The authors provide a theoretical derivation of their approach. The resulting NSD method involves normalizing the maximums of logs of inverses of Bayesian probabilities of one term given the occurrence of the other. The resulting measure, NSD, shows statistically significant improvement over the other methods on these three software projects. NSD also had the lowest time complexity in both pre-processing and relatedness calculations. The paper includes an extensive set of references and a thorough explanation of the experiments, results, derivation of the NSD measure, NSD results, limitations, and related work. This paper would be a good technical introduction to the area of semantic relatedness in source code. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 25, Issue 1
December 2015
339 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/2852270
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2015
Accepted: 01 September 2015
Revised: 01 May 2015
Received: 01 November 2014
Published in TOSEM Volume 25, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Semantic relatedness
  2. clustering
  3. information retrieval
  4. information theory
  5. latent semantics

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Louisiana Board of Regents Research Competitiveness Subprogram (LABoR-RCS)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Safe Maintenance of Railways using COTS Mobile Devices: The Remote Worker DashboardACM Transactions on Cyber-Physical Systems10.1145/36071937:4(1-20)Online publication date: 14-Oct-2023
  • (2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
  • (2023)Formal Specification, Verification and Repair of Contiki’s SchedulerACM Transactions on Cyber-Physical Systems10.1145/36059487:4(1-28)Online publication date: 14-Oct-2023
  • (2023)Applications of natural language processing in software traceabilityJournal of Systems and Software10.1016/j.jss.2023.111616198:COnline publication date: 1-Apr-2023
  • (2023)Spectrum-based feature localization for families of systemsJournal of Systems and Software10.1016/j.jss.2022.111532195:COnline publication date: 1-Jan-2023
  • (2022)The Effect of Feature Characteristics on the Performance of Feature Location TechniquesIEEE Transactions on Software Engineering10.1109/TSE.2021.304973548:6(2066-2085)Online publication date: 1-Jun-2022
  • (2022)DPWord2Vec: Better Representation of Design Patterns in SemanticsIEEE Transactions on Software Engineering10.1109/TSE.2020.301733648:4(1228-1248)Online publication date: 1-Apr-2022
  • (2022)Exploring the Feasibility of Transformer Based Models on Question Relatedness2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136(831-838)Online publication date: Dec-2022
  • (2022)FineCodeAnalyzer: Multi-Perspective Source Code Analysis Support for Software Developer Through Fine-Granular Level Interactive Code VisualizationIEEE Access10.1109/ACCESS.2022.315139510(20496-20513)Online publication date: 2022
  • (2022)Synergies Between Artificial Intelligence and Software Engineering: Evolution and TrendsHandbook on Artificial Intelligence-Empowered Applied Software Engineering10.1007/978-3-031-08202-3_2(11-36)Online publication date: 4-Sep-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media