research-article

Estimating Semantic Relatedness in Source Code

Authors:

Anas Mahmoud,

Gary BradshawAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology (TOSEM), Volume 25, Issue 1

Article No.: 10, Pages 1 - 35

https://doi.org/10.1145/2824251

Published: 02 December 2015 Publication History

Get Access

Abstract

Contemporary software engineering tools exploit semantic relations between individual code terms to aid in code analysis and retrieval tasks. Such tools employ word similarity methods, often used in natural language processing (nlp), to analyze the textual content of source code. However, the notion of similarity in source code is different from natural language. Source code often includes unnatural domain-specific terms (e.g., abbreviations and acronyms), and such terms might be related due to their structural relations rather than linguistic aspects. Therefore, applying natural language similarity methods to source code without adjustment can produce low-quality and error-prone results. Motivated by these observations, we systematically investigate the performance of several semantic-relatedness methods in the context of software. Our main objective is to identify the most effective semantic schemes in capturing association relations between source code terms. To provide an unbiased comparison, different methods are compared against human-generated relatedness information using terms from three software systems. Results show that corpus-based methods tend to outperform methods that exploit external sources of semantic knowledge. However, due to inherent code limitations, the performance of such methods is still suboptimal. To address these limitations, we propose Normalized Software Distance (nsd), an information-theoretic method that captures semantic relatedness in source code by exploiting the distributional cues of code terms across the system. nsd overcomes data sparsity and lack of context problems often associated with source code, achieving higher levels of resemblance to the human perception of relatedness at the term and the text levels of code.

References

[1]

Charu Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.

Digital Library

Google Scholar

[2]

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics. 19--27.

Digital Library

Google Scholar

[3]

Nicolas Anquetil, Cédric Fourrier, and Timothy Lethbridge. 1999. Experiments with clustering as a software remodularization method. In Proceedings of the Working Conference on Reverse Engineering. 235--255.

Digital Library

Google Scholar

[4]

Nicolas Anquetil and Timothy Lethbridge. 1998. Assessing the relevance of identifier names in a legacy software system. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research. 4--14.

Digital Library

Google Scholar

[5]

Nicolas Anquetil and Timothy Lethbridge. 2003. Comparative study of clustering algorithms and abstract representations for software remodularisation. IEE Softw. 150, 3 (2003), 185--201.

Crossref

Google Scholar

[6]

Javed Aslam, Emine Yilmaz, and Virgiliu Pavlu. 2005. A geometric interpretation of r-precision and its correlation with average precision. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 573--574.

Digital Library

Google Scholar

[7]

Gabriele Bavota, Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. 2014. Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans. Softw. Eng. 4, 7 (2014), 671--694.

Digital Library

Google Scholar

[8]

Fabian Beck and Stephan Diehl. 2013. On the impact of software evolution on software clustering. Empirical Softw. Eng. 18, 5 (2013), 970--1004.

Crossref

Google Scholar

[9]

Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on World Wide Web. 757--766.

Digital Library

Google Scholar

[10]

Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguistics 32, 1 (2006), 13--47.

Digital Library

Google Scholar

[11]

Raluca Budiu, Christiaan Royer, and Peter Pirolli. 2007. Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In Proceedings of the International Conference on Large Scale Semantic Access to Content (Text, Image, Video, and Sound). 314--332.

Digital Library

Google Scholar

[12]

John Bullinaria and Joseph Levy. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. Behav. Res. Method. 39, 3 (2007), 510--526.

Crossref

Google Scholar

[13]

Michael Cafarella and Oren Etzioni. 2005. A search engine for natural language applications. In Proceedings of the International Conference on World Wide Web. 442--452.

Digital Library

Google Scholar

[14]

Bruno Caprile and Paolo Tonella. 2000. Restructuring program identifier names. In Proceedings of the International Conference on Software Maintenance. 97--107.

Digital Library

Google Scholar

[15]

Ping Chen and Shi Lin. 2010. Automatic keyword prediction using Google similarity distance. Expert Syst. Appl. 37, 3 (2010), 1928--1938.

Digital Library

Google Scholar

[16]

Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Comput. Linguistics 16, 1 (1990), 22--29.

Digital Library

Google Scholar

[17]

Rudi Cilibrasi and Paul Vitanyi. 2007. The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3 (2007), 370--383.

Digital Library

Google Scholar

[18]

Jane Cleland-Huang, Raffaella Settimi, Chuan Duan, and Xuchang Zou. 2005. Utilizing supporting evidence to improve dynamic requirements traceability. In Proceedings of the International Conference on Requirements Engineering. 135--144.

Digital Library

Google Scholar

[19]

Tathagata Dasgupta, Mark Grechanik, Evan Moritz, Bogdan Dit, and Denys Poshyvanyk. 2013. Enhancing software traceability by automatically expanding corpora with relevant documentation. In Proceedings of the International Conference on Software Maintenance. 320--329.

Digital Library

Google Scholar

[20]

John Davey and Elizabeth Burd. 2000. Evaluating the suitability of data clustering for software remodularization. In Proceedings of the Working Conference on Reverse Engineering. 268--277.

Digital Library

Google Scholar

[21]

Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichelle. 2012. Using IR methods for labeling source code artifacts: Is it worthwhile? In Proceedings of the International Conference on Program Comprehension. 193--202.

Crossref

Google Scholar

[22]

Andrea De Lucia, Rocco Oliveto, and Genoveffa Tortora. 2009. Assessing IR-based traceability recovery tools through controlled experiments. Empirical Softw. Eng. 14, 1 (2009), 57--92.

Digital Library

Google Scholar

[23]

Angela Dean and Daniel Voss. 1999. Design and Analysis of Experiments. Springer.

Google Scholar

[24]

Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 6 (1990), 391--407.

Crossref

Google Scholar

[25]

Serge Demeyer, Stéphane Ducasse, and Oscar Nierstrasz. 2003. Object-Oriented Reengineering Patterns. Elsevier.

Digital Library

Google Scholar

[26]

James Demmel and William Kahan. 1990. Accurate singular values of bidiagonal matrices. J. Sci. Stat. Comput. 11, 5 (1990), 873--912.

Digital Library

Google Scholar

[27]

Letha Etzkorn, Carl Davis, and Lisa Bowen. 2001. The language of comments in computer software: A sublanguage of English. J. Pragmatics 33, 11 (2001), 1731--1756.

Crossref

Google Scholar

[28]

Jean Falleri, Marianne Huchard, Mathieu Lafourcade, Clementine Nebut, Violaine Prince, and Michel Dao. 2010. Automatic extraction of a WordNet-like identifier network from software. In Proceedings of the International Conference on Program Comprehension. 4--13.

Digital Library

Google Scholar

[29]

C. Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.

Google Scholar

[30]

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Trans. Inform. Syst. 20, 1 (2002), 116--131.

Digital Library

Google Scholar

[31]

Peter Foltz, Darrell Laham, and Thomas Landauer. 1999. The intelligent essay assessor: Applications to educational technology. Interact. Multimedia Educ. J. Comput. Enhanced Learn. 1, 2 (1999), 1--28.

Google Scholar

[32]

Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison--Wesley.

Digital Library

Google Scholar

[33]

Takuya Funahashi and Hayato Yamana. 2010. Reliability verification of search engines' hit counts: How to select a reliable hit count for a query. In Proceedings of the International Conference on Current Trends in Web Engineering. 114--125.

Digital Library

Google Scholar

[34]

Mark Gabel and Su Zhendong. 2010. A study of the uniqueness of source code. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering. 147--156.

Digital Library

Google Scholar

[35]

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the International Joint Conference on Artifical Intelligence. 1606--1611.

Digital Library

Google Scholar

[36]

Jorge Gracia and Eduardo Mena. 2008. Web-based measure of semantic relatedness. In Proceedings of the AAAI Workshop on Wikipedia and Artificial Intelligence. 136--150.

Digital Library

Google Scholar

[37]

Jorge Gracia, Raquel Trillo, Mauricio Espinoza, and Eduardo Mena. 2006. Querying the web: A multiontology disambiguation method. In Proceedings of the International Conference on Web Engineering. 241--248.

Digital Library

Google Scholar

[38]

Scott Grant and James Cordy. 2010. Estimating the optimal number of latent concepts in source code analysis. In Proceedings of the International Working Conference on Source Code Analysis and Manipulation. 65--74.

Digital Library

Google Scholar

[39]

Mark Grechanik, Collin McMillan, Tathagata Dasgupta, Denys Poshyvanyk, and Malcom Gethers. 2014. Redacting sensitive information in software artifacts. In Proceedings of the International Conference on Program Comprehension. 314--325.

Digital Library

Google Scholar

[40]

Weiwei Guo, Hao Li, Heng Ji, and Mona Diab. 2013. Linking tweets to news: A framework to enrich short text data in social media. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 239--249.

Google Scholar

[41]

Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the use of automated text summarization techniques for summarizing source code. In Proceedings of the Working Conference on Reverse Engineering. 35--44.

Digital Library

Google Scholar

[42]

Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the International Conference on Software Engineering. 842--851.

Digital Library

Google Scholar

[43]

Marti Hearst and Jan Pedersen. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 76--84.

Digital Library

Google Scholar

[44]

Emily Hill, Zachary Fry, Haley Boyd, Giriprasad Sridhara, Yana Novikova, Lori Pollock, and K. Vijay-Shanker. 2008. AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In Proceedings of the International Working Conference on Mining Software Repositories. 79--88.

Digital Library

Google Scholar

[45]

Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Proceedings of the International Conference on Software Maintenance. 837--847.

Digital Library

Google Scholar

[46]

Andreas Holzinger, Pinar Yildirim, Michael Geier, and Klaus-Martin Simonic. 2013. Quality-based knowledge discovery from medical text on the web. In Quality Issues in the Management of Web Information, Gabriella Pasi, Gloria Bordogna, and Lakhmi Jain (Eds.), Springer, Berlin Heidelberg, 145--158.

Google Scholar

[47]

Matthew Howard, Samir Gupta, Lori Pollock, and K. Vijay-Shanker. 2013. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proceedings of the Working Conference on Mining Software Repositories. 377--386.

Digital Library

Google Scholar

[48]

Jane Huffman-Hayes, Alex Dekhtyar, and Senthil Sundaram. 2006. Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans. Softw. Eng. 32, 1 (2006), 4--19.

Digital Library

Google Scholar

[49]

Aminul Islam and Diana Inkpen. 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 2 (2008), 1--25.

Digital Library

Google Scholar

[50]

Md. Islam and Diana Inkpen. 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation. 1033--1038.

Google Scholar

[51]

Jay Jiang and David Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics. 19--33.

Google Scholar

[52]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28, 7 (2002), 654--670.

Digital Library

Google Scholar

[53]

Slava Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics, Speech Signal Process. 35, 3 (1987), 400--401.

Crossref

Google Scholar

[54]

Adrian Kuhn, Stéphane Ducasse, and Tudor Gîrba. 2007. Semantic clustering: Identifying topics in source code. Inform. Softw. Technol. 49, 3 (2007), 230--243.

Digital Library

Google Scholar

[55]

Thomas Landauer and Susan Dutnais. 1997. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 2 (1997), 211--240.

Crossref

Google Scholar

[56]

Dawn Lawrie, Henry Feild, and David Binkley. 2007a. Extracting meaning from abbreviated identifiers. In Proceedings of the International Working Conference on Source Code Analysis and Manipulation. 213--222.

Digital Library

Google Scholar

[57]

Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007b. Effective identifier names for comprehension and memory. Innov. Syst. Softw. Eng. 4, 3 (2007), 303--318.

Crossref

Google Scholar

[58]

Claudia Leacock and Martin Chodorow. 1998. Combining Local Context and WordNet Similarity for Word Sense Identification. MIT Press, 265--283.

Google Scholar

[59]

M. Lehman. 1984. On understanding laws, evolution, and conservation in the large-program life cycle. J. Syst. Softw. 1, 3 (1984), 213--221.

Digital Library

Google Scholar

[60]

Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the Annual International Conference on Systems Documentation. 24--26.

Digital Library

Google Scholar

[61]

Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning. 296--304.

Digital Library

Google Scholar

[62]

Sugandha Lohar, Sorawit Amornborvornwong, Andrea Zisman, and Jane Cleland-Huang. 2013. Improving trace accuracy through data-driven configuration and composition of tracing features. In Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. 378--388.

Digital Library

Google Scholar

[63]

Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Method. Instrum. Comput. 28, 2 (1996), 203--208.

Crossref

Google Scholar

[64]

Anas Mahmoud and Nan Niu. 2015. On the role of semantics in automated requirements tracing. Require. Eng. 20, 3 (2015), 281--300.

Digital Library

Google Scholar

[65]

Jonathan Maletic and Naveen Valluri. 1999. Automatic software clustering via latent semantic analysis. In Proceedings of the International Conference on Automated Software Engineering. 251--254.

Digital Library

Google Scholar

[66]

Onaiza Maqbool and Haroon Babri. 2007. Hierarchical clustering for software architecture recovery. IEEE Trans. Softw. Eng. 33, 11 (2007), 759--780.

Digital Library

Google Scholar

[67]

Andrian Marcus and Jonathan Maletic. 2001. Identification of high-level concept clones in source code. In Proceedings of the International Conference on Automated Software Engineering. 107--114.

Digital Library

Google Scholar

[68]

Andrian Marcus and Jonathan Maletic. 2003. Recovering documentation-to-source-code traceability links using latent semantic indexing. In Proceedings of the International Conference on Software Engineering. 125--135.

Digital Library

Google Scholar

[69]

Andrian Marcus, Denys Poshyvanyk, and Rudolf Ferenc. 2008. Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans. Softw. Eng. 34, 2 (2008), 287--300.

Digital Library

Google Scholar

[70]

Rada Mihalcea, Courtney Corley, and Carlo Strapparava. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the National Conference on Artificial Intelligence. 775--780.

Digital Library

Google Scholar

[71]

George Miller and Walter Charles. 1991. Contextual correlates of semantic similarity. Lang. Cognitive Process. 6, 1 (1991), 1--28.

Crossref

Google Scholar

[72]

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. 100--108.

Digital Library

Google Scholar

[73]

Nan Niu and Anas Mahmoud. 2012. Enhancing candidate link generation for requirements tracing: The cluster hypothesis revisited. In Proceedings of the IEEE International Requirements Engineering Conference. 81--90.

Digital Library

Google Scholar

[74]

Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. 2010. On the equivalence of information retrieval methods for automated traceability link recovery. In Proceedings of the International Conference on Program Comprehension. 68--71.

Digital Library

Google Scholar

[75]

Annibale Panichella, Bogdan Dit, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, and Andrea De Lucia. 2013. How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In Proceedings of the International Conference on Software Engineering. 522--531.

Digital Library

Google Scholar

[76]

Lori Pollock, K. Vijay-Shanker, Emily Hill, Giriprasad Sridhara, and David Shepherd. 2013. Natural Language-Based Software Analyses and Tools for Software Maintenance. Lecture Notes in Computer Science, Vol. 7171. Springer, Berlin Heidelberg, 94--125.

Google Scholar

[77]

Denys Poshyvanyk, Andrian Marcus, Vaclav Rajlich, Yann-Gael Gueheneuc, and Giuliano Antoniol. 2006. Combining probabilistic ranking and latent semantic indexing for feature identification. In Proceedings of the IEEE International Conference on Program Comprehension. 137--148.

Digital Library

Google Scholar

[78]

Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the International Conference on World Wide Web. 337--346.

Digital Library

Google Scholar

[79]

Gabriel Recchia and Michael Jones. 2009. More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. Behav. Res. Method. 41, 3 (2009), 647--656.

Crossref

Google Scholar

[80]

Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the International Joint Conference on Artificial Intelligence. 448--453.

Digital Library

Google Scholar

[81]

B. Rosario. 2000. Latent Semantic Indexing: An Overview. INFOSYS 240 Spring Paper, University of California, Berkeley.

Google Scholar

[82]

G. Salton, A. Wong, and C. Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.

Digital Library

Google Scholar

[83]

Giuseppe Scanniello and Andrian Marcus. 2011. Clustering support for static concept location in source code. In Proceedings of the International Conference on Program Comprehension. 1--10.

Digital Library

Google Scholar

[84]

David Shepherd, Zachary Fry, Emily Hill, Lori Pollock, and K. Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the International Conference on Aspect-Oriented Software Development. 212--224.

Digital Library

Google Scholar

[85]

Mark Shtern and Vassilios Tzerpos. 2011. Evaluating software clustering using multiple simulated authoritative decompositions. In Proceedings of the International Conference on Software Maintenance. 353--361.

Digital Library

Google Scholar

[86]

R. Sibson. 1973. SLINK: An optimally efficient algorithm for the single-link cluster method. Comput. J. 16, 1 (1973), 30--34.

Crossref

Google Scholar

[87]

Noam Slonim and Naftali Tishby. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 208--215.

Digital Library

Google Scholar

[88]

Harry Sneed. 1996. Object-oriented COBOL recycling. In Proceedings of the Working Conference on Reverse Engineering. 169--178.

Digital Library

Google Scholar

[89]

Daniel Sousa, Luís Sarmento, and Eduarda Mendes Rodrigues. 2010. Characterization of the Twitter replies network: Are user ties social or topical? In Proceedings of the International Workshop on Search and Mining User-generated Contents. 63--70.

Digital Library

Google Scholar

[90]

Karen Sparck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 1 (1972), 11--21.

Crossref

Google Scholar

[91]

Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K. Vijay-Shanker. 2010. Towards automatically generating summary comments for Java methods. In Proceedings of the International Conference on Automated Software Engineering. 43--52.

Digital Library

Google Scholar

[92]

Giriprasad Sridhara, Emily Hill, Lori Pollock, and K. Vijay-Shanker. 2008. Identifying word relations in software: A comparative study of semantic similarity tools. In Proceedings of the IEEE International Conference on Program Comprehension. 123--132.

Digital Library

Google Scholar

[93]

Michael Strube and Simone Ponzetto. 2006. WikiRelate&excl; Computing semantic relatedness using Wikipedia. In Proceedings of the National Conference on Artificial Intelligence. 1419--1424.

Digital Library

Google Scholar

[94]

Mike Thelwall. 2008. Extracting accurate and complete results from search engines: Case study windows live. J. Am. Soc. Inform. Sci. Technol. 59, 1 (2008), 38--50.

Digital Library

Google Scholar

[95]

Yuan Tian, David Lo, and Julia Lawall. 2014. Automated construction of a software-specific word similarity database. In Proceedings of the IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering. 44--53.

Crossref

Google Scholar

[96]

Peter Turney. 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the European Conference on Machine Learning. 491--502.

Digital Library

Google Scholar

[97]

Vassilios Tzerpos and R. Holt. 2000. ACDC: An algorithm for comprehension-driven clustering. In Proceedings of the Working Conference on Reverse Engineering. 258--267.

Digital Library

Google Scholar

[98]

C. J. van Rijsbergen. 1979. Information Retrieval. Butterworths.

Digital Library

Google Scholar

[99]

Vladislav Veksler, Ryan Govostes, and Wayne Gray. 2008. Defining the dimensions of the human semantic space. In Proceedings of the Annual Meeting of the Cognitive Science Society. 1282--1287.

Google Scholar

[100]

Xing Wei and B. Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 178--185.

Digital Library

Google Scholar

[101]

Zhihua Wen and Vassilios Tzerpos. 2003. An optimal algorithm for MoJo distance. In Proceedings of the IEEE International Workshop on Program Comprehension. 227--235.

Digital Library

Google Scholar

[102]

Zhihua Wen and Vassilios Tzerpos. 2004. An effectiveness measure for software clustering algorithms. In Proceedings of the International Workshop on Program Comprehension. 194--203.

Digital Library

Google Scholar

[103]

Wei Lee Woon and Stuart Madnick. 2009. Asymmetric information distances for automated taxonomy construction. Knowl. Inform. Syst. 21, 1 (2009), 91--111.

Digital Library

Google Scholar

[104]

Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 133--138.

Digital Library

Google Scholar

[105]

Zheng Xiang, Karl Wöber, and Daniel Fesenmaier. 2008. Representation of the online tourism domain in search engines. J. Travel Res. 47, 2 (2008), 137--150.

Crossref

Google Scholar

[106]

Jinqiu Yang and Tan Lin. 2014. SWordNet: Inferring semantically related words from software context. Empirical Softw. Eng. 19, 6 (2014), 1856--1886.

Digital Library

Google Scholar

Cited By

View all

Zoppi TMungiello ICeccarelli ACirillo ASarti LEsposito LScaglione GRepetto SBondavalli A(2023)Safe Maintenance of Railways using COTS Mobile Devices: The Remote Worker DashboardACM Transactions on Cyber-Physical Systems10.1145/36071937:4(1-20)Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1145/3607193
Rahman MRoy C(2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
https://dl.acm.org/doi/10.1145/3607179
Mousavi HEbnenasir AMahmoudzadeh E(2023)Formal Specification, Verification and Repair of Contiki’s SchedulerACM Transactions on Cyber-Physical Systems10.1145/36059487:4(1-28)Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1145/3605948
Show More Cited By

Index Terms

Estimating Semantic Relatedness in Source Code
1. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software implementation planning
        Software design techniques
    2. Software development process management

Recommendations

Hindi Word Sense Disambiguation Using Semantic Relatedness Measure
MIWAI 2013: Proceedings of the 7th International Workshop on Multi-disciplinary Trends in Artificial Intelligence - Volume 8271

In this paper we propose and evaluate a method of Hindi word sense disambiguation that computes similarity based on the semantics. We adapt an existing measure for semantic relatedness between two lexically expressed concepts of Hindi WordNet. This ...
Evaluating semantic similarity and relatedness over the semantic grouping of clinical term pairs

Display Omitted Objective: develop a method to quantify the similarity and relatedness of biomedical and clinical term pairs.Semantic similarity and relatedness measures exploit information extrapolated from the Unified Medical Language System.Evaluates ...
Measuring Semantic Relatedness between Words Using Lexical Context
CIS '11: Proceedings of the 2011 Seventh International Conference on Computational Intelligence and Security

Semantic relatedness measurement between words is always a hot issue interested by many researchers. It can be applied to various tasks of NLP and IR with a big challenge. We propose a method for measuring semantic relatedness between words using lexical ...

Reviews

Reviewer: David A. Gustafson

Analyzing existing source code to identify relationships between classes might be more accurate using the authors' new method, normalized software distance, which is introduced, derived, and analyzed in this paper. Correctly identifying relationships between sections of code could be an essential starting point in automating tools for improving testing, maintenance, and other improvements in large software systems. Natural language processing (NLP) has been used to discover relationships between sections of code; however, source code is very different from natural language. Their proposed metric, normalized software distance (NSD), is shown to be superior to general NLP methods. The authors analyzed a number of currently proposed methods, including latent semantic analysis (LSA) [1], normalized Google distance (NGD) [2], pointwise mutual information (PMI) [3], path-based methods [4,5], information-content methods [6,7,8], and a definition-of-words method [9]. Three software systems were used for comparison. All were in Java. One was a student-developed open-source medical application. It was 47.6 KLOC (thousands of lines of code). The second was a subproject of the Apache Ant project. It was 40.9 KLOC. The third was a financial software package contributed by an industrial partner. Each software system had participants with two or more years of experience with the respective software. The above methods were applied to these three software projects and the results compared to the knowledge of the participants. The evaluation measures were recall analysis and mean average precision. The LSA method was the best of the current methods. The analysis showed a number of issues. The methods based on external sources were not as good as those based on the source code. However, the sparsity and lack of uniqueness in source code were an issue for code-based approaches, including the perfect term dependence (where two terms only appear together). Their approach, NSD, uses a hybrid technique using the class as the level of granularity for the code analysis. The authors provide a theoretical derivation of their approach. The resulting NSD method involves normalizing the maximums of logs of inverses of Bayesian probabilities of one term given the occurrence of the other. The resulting measure, NSD, shows statistically significant improvement over the other methods on these three software projects. NSD also had the lowest time complexity in both pre-processing and relatedness calculations. The paper includes an extensive set of references and a thorough explanation of the experiments, results, derivation of the NSD measure, NSD results, limitations, and related work. This paper would be a good technical introduction to the area of semantic relatedness in source code. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 25, Issue 1

December 2015

339 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/2852270

Editor:
David S. Rosenblum
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2015

Accepted: 01 September 2015

Revised: 01 May 2015

Received: 01 November 2014

Published in TOSEM Volume 25, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Louisiana Board of Regents Research Competitiveness Subprogram (LABoR-RCS)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
804
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zoppi TMungiello ICeccarelli ACirillo ASarti LEsposito LScaglione GRepetto SBondavalli A(2023)Safe Maintenance of Railways using COTS Mobile Devices: The Remote Worker DashboardACM Transactions on Cyber-Physical Systems10.1145/36071937:4(1-20)Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1145/3607193
Rahman MRoy C(2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
https://dl.acm.org/doi/10.1145/3607179
Mousavi HEbnenasir AMahmoudzadeh E(2023)Formal Specification, Verification and Repair of Contiki’s SchedulerACM Transactions on Cyber-Physical Systems10.1145/36059487:4(1-28)Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1145/3605948
Pauzi ZCapiluppi A(2023)Applications of natural language processing in software traceabilityJournal of Systems and Software10.1016/j.jss.2023.111616198:COnline publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111616
Michelon GMartinez JSotto-Mayor BArrieta AAssunção WAbreu REgyed A(2023)Spectrum-based feature localization for families of systemsJournal of Systems and Software10.1016/j.jss.2022.111532195:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jss.2022.111532
Razzaq AVentresque AKoschke RDe Lucia ABuckley J(2022)The Effect of Feature Characteristics on the Performance of Feature Location TechniquesIEEE Transactions on Software Engineering10.1109/TSE.2021.304973548:6(2066-2085)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TSE.2021.3049735
Liu DJiang HLi XRen ZQiao LDing Z(2022)DPWord2Vec: Better Representation of Design Patterns in SemanticsIEEE Transactions on Software Engineering10.1109/TSE.2020.301733648:4(1228-1248)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TSE.2020.3017336
Shu HGao PYang ZLi CWu M(2022)Exploring the Feasibility of Transformer Based Models on Question Relatedness2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136(831-838)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136
Qayum AKhan SInayat-Ur-Rehman Akhunzada A(2022)FineCodeAnalyzer: Multi-Perspective Source Code Analysis Support for Software Developer Through Fine-Granular Level Interactive Code VisualizationIEEE Access10.1109/ACCESS.2022.315139510(20496-20513)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3151395
Ramírez ARomero J(2022)Synergies Between Artificial Intelligence and Software Engineering: Evolution and TrendsHandbook on Artificial Intelligence-Empowered Applied Software Engineering10.1007/978-3-031-08202-3_2(11-36)Online publication date: 4-Sep-2022
https://doi.org/10.1007/978-3-031-08202-3_2
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Hindi Word Sense Disambiguation Using Semantic Relatedness Measure

Evaluating semantic similarity and relatedness over the semantic grouping of clinical term pairs

Measuring Semantic Relatedness between Words Using Lexical Context

Reviews

Access critical reviews of Computing literature here