survey

Opportunities and Challenges in Code Search Tools

Authors:

John GrundyAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 54, Issue 9

Article No.: 196, Pages 1 - 40

https://doi.org/10.1145/3480027

Published: 08 October 2021 Publication History

Abstract

Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, such as deep learning and information retrieval approaches, to retrieve expected code from a large-scale codebase. However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code search studies, we systematically reviewed 81 relevant studies. We investigated the publication trends of code search studies, analyzed key components, such as codebase, query, and modeling technique used to build code search tools, and classified existing tools into focusing on supporting seven different search tasks. Based on our findings, we identified a set of outstanding challenges in existing studies and a research roadmap for future code search research.

References

[1]

Hervé Abdi. 2007. The Kendall rank correlation coefficient. Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA (2007), 508–510.

[2]

Afsoon Afzal, Manish Motwani, Kathryn Stolee, Yuriy Brun, and Claire Le Goues. 2019. SOSRepair: Expressive semantic search for real-world program repair. IEEE Transactions on Software Engineering (2019).

[3]

Parag Agrawal, Arvind Arasu, and Raghav Kaushik. 2010. On indexing error-tolerant set containment. In International Conference on Management of Data. 927–938.

Digital Library

[4]

Shayan Akbar and Avinash Kak. 2019. SCOR: Source code retrieval with semantics and order. In Working Conference on Mining Software Repositories. IEEE, 1–12.

Digital Library

[5]

Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal modelling of source code and natural language. In International Conference on Machine Learning. 2123–2132.

Digital Library

[6]

Sushil Bajracharya and Cristina Lopes. 2009. Mining search topics from a code search engine usage log. In Working Conference on Mining Software Repositories. IEEE, 111–120.

Digital Library

[7]

Sushil Krishna Bajracharya and Cristina Videira Lopes. 2012. Analyzing and mining a code search engine usage log. Empirical Software Engineering 17, 4–5 (2012), 424–466.

Digital Library

[8]

Sushil K. Bajracharya, Joel Ossher, and Cristina V. Lopes. 2010. Leveraging usage similarity for effective retrieval of examples in code repositories. In International Symposium on Foundations of Software Engineering. 157–166.

[9]

Vipin Balachandran. 2015. Query by example in large-scale code repositories. In IEEE International Conference on Software Maintenance and Evolution. IEEE, 467–476.

Digital Library

[10]

Lingfeng Bao, Zhenchang Xing, Xin Xia, David Lo, Minghui Wu, and Xiaohu Yang. 2020. psc2code: Denoising code extraction from programming screencasts. ACM Transactions on Software Engineering and Methodology 29, 3 (2020), 1–38.

Digital Library

[11]

Anton Barua, Stephen W. Thomas, and Ahmed E. Hassan. 2014. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering 19, 3 (2014), 619–654.

Digital Library

[12]

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In International Conference on Management of Data. 322–331.

Digital Library

[13]

Farnaz Behrang, Steven P. Reiss, and Alessandro Orso. 2018. GUIfetch: Supporting app design and development through GUI search. In International Conference on Mobile Software Engineering and Systems. 236–246.

Digital Library

[14]

Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In EICS. 1–6.

[15]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.

Digital Library

[16]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5 (2017), 135–146.

[17]

Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. 2010. Example-centric programming: Integrating web search into the development environment. In SIGCHI Conference on Human Factors in Computing Systems. ACM, 513–522.

[18]

Joel Brandt, Philip J . Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. 2009. Two studies of opportunistic programming: Interleaving web foraging, learning, and writing code. In SIGCHI Conference on Human Factors in Computing Systems. ACM, 1589–1598.

[19]

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In International Symposium on Foundations of Software Engineering. 964–974.

Digital Library

[20]

Long Chen, Wei Ye, and Shikun Zhang. 2019. Capturing source code semantics via tree-based convolution over API-enhanced AST. In ACM International Conference on Computing Frontiers. 174–182.

Digital Library

[21]

Zhengzhao Chen, Renhe Jiang, Zejun Zhang, Yu Pei, Minxue Pan, Tian Zhang, and Xuandong Li. 2020. Enhancing example-based code search with functional semantics. Journal of Systems and Software (2020), 110568.

[22]

Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In SIGIR Conference on Research and Development in Information Retrieval. 659–666.

Digital Library

[23]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.

[24]

David Roxbee Cox and Alan Stuart. 1955. Some quick sign tests for trend in location and dispersion. Biometrika 42, 1/2 (1955), 80–95.

[25]

Kostadin Damevski, David Shepherd, and Lori Pollock. 2014. A case study of paired interleaving for evaluating code search techniques. In Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering. IEEE, 54–63.

[26]

Kostadin Damevski, David Shepherd, and Lori Pollock. 2016. A field study of how developers locate features in source code. Empirical Software Engineering 21, 2 (2016), 724–747.

Digital Library

[27]

Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. ACM SIGPLAN Notices 49, 6 (2014), 349–360.

Digital Library

[28]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.

Digital Library

[29]

Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2016. Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 461–470.

Digital Library

[30]

Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2019. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In IEEE Symposium on Security and Privacy. IEEE, 472–489.

[31]

Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. 2013. Feature location in source code: A taxonomy and survey. Journal of Software: Evolution and Process 25, 1 (2013), 53–95.

[32]

Ian Drosos, Titus Barik, Philip J. Guo, Robert DeLine, and Sumit Gulwani. 2020. Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists. In CHI Conference on Human Factors in Computing Systems. 1–12.

Digital Library

[33]

Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In SIGSAC Conference on Computer and Communications Security. 480–491.

Digital Library

[34]

Brendan J. Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972–976.

[35]

Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Joint Meeting on Foundations of Software Engineering. 49–60.

Digital Library

[36]

Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun. 2018. Vulseeker: A semantic learning based vulnerability seeker for cross-platform binary. In International Conference on Automated Software Engineering. IEEE, 896–899.

Digital Library

[37]

Xi Ge, David Shepherd, Kostadin Damevski, and Emerson Murphy-Hill. 2014. How developers use multi-recommendation system in local code search. In VL/HCC. IEEE, 69–76.

[38]

Mohammad Gharehyazie, Baishakhi Ray, and Vladimir Filkov. 2017. Some from here, some from there: Cross-project code reuse in github. In International Conference on Mining Software Repositories. IEEE, 291–301.

Digital Library

[39]

Aristides Gionis, Piotr Indyk, and Rajeev Motwanil. 1999. Similarity search in high dimensions via hashing. In VLDB, Vol. 99. 518–529.

Digital Library

[40]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In International Conference on Software Engineering. IEEE, 933–944.

Digital Library

[41]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2019. CodeKernel: A graph kernel based approach to the selection of API usage examples. In International Conference on Automated Software Engineering. IEEE, 590–601.

Digital Library

[42]

Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In International Conference on Software Engineering. IEEE, 842–851.

[43]

Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.

Digital Library

[44]

Simon Harris. 2003. Simian-similarity analyser. HYPERLINK Available from http://www. harukizaemon. com/simian/index. html (2003).

[45]

Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, Wang Gao, and Mengting Yuan. 2020. Unsupervised software repositories mining and its application to code search. Software-Practice & Experience 50, 3 (2020), 299–322.

[46]

Qing Huang, An Qiu, Maosheng Zhong, and Yuan Wang. 2020. A code-description representation learning model based on attention. In International Conference on Software Analysis, Evolution and Reengineering. IEEE, 447–455.

[47]

Qing Huang and Guoqing Wu. 2019. Enhance code search via reformulating queries with evolving contexts. Automated Software Engineering 26, 4 (2019), 705–732.

[48]

Qing Huang and Huaiguang Wu. 2019. QE-integrating framework based on Github knowledge and SVM ranking. Science China. Information Science 62, 5 (2019), 52102.

[49]

Qing Huang, Yang Yang, and Ming Cheng. 2019. Deep learning the semantics of change sequences for query expansion. Software-Practice & Experience 49, 11 (2019), 1600–1617.

[50]

Qing Huang, Yangrui Yang, Xue Zhan, Hongyan Wan, and Guoqing Wu. 2018. Query expansion based on statistical learning from code changes. Software-Practice & Experience 48, 7 (2018), 1333–1351.

[51]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In AMACL. 2073–2083.

[52]

He Jiang, Liming Nie, Zeyi Sun, Zhilei Ren, Weiqiang Kong, Tao Zhang, and Xiapu Luo. 2016. ROSF: Leveraging information retrieval and supervised learning for recommending code snippets. IEEE Transactions on Services Computing 12, 1 (2016), 34–46.

[53]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In International Conference on Software Engineering. IEEE, 96–105.

Digital Library

[54]

Renhe Jiang, Zhengzhao Chen, Zejun Zhang, Yu Pei, Minxue Pan, and Tian Zhang. 2018. Semantics-based code search using input/output examples. In International Working Conference on Source Code Analysis and Manipulation. IEEE, 92–102.

[55]

Huan Jin and Lei Xiong. 2019. A query expansion method based on evolving source code. Wuhan University Journal of Natural Sciences 24, 5 (2019), 391–399.

[56]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670.

Digital Library

[57]

Barbara Kitchenham and Stuart Charters. 2007. Guidelines for Performing Systematic literature Reviews in Software Engineering (Version 2.3). Technical Report, Keele University and University of Durham.

[58]

Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In International Conference on Software Engineering. ACM, 664–675.

Digital Library

[59]

Wei Ming Khoo, Alan Mycroft, and Ross Anderson. 2013. Rendezvous: A search engine for binary code. In Working Conference on Mining Software Repositories. IEEE, 329–338.

[60]

Kisub Kim, Dongsun Kim, Tegawendé F. Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY: a code-to-codesearch engine. In International Conference on Software Engineering. 946–957.

Digital Library

[61]

Kisub Kim, Dongsun Kim, Tegawendé F Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY: A code-to-code search engine. In International Conference on Software Engineering. 946–957.

Digital Library

[62]

Jacob Krüger, Thorsten Berger, and Thomas Leich. 2019. Features and how to find them: A survey of manual feature location. Software Engineering for Variability Intensive Systems (2019), 153–172.

[63]

Brian Kulis and Kristen Grauman. 2009. Kernelized locality-sensitive hashing for scalable image search. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2130–2137.

[64]

An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2017. Bug localization with combination of deep learning and information retrieval. In International Conference on Program Comprehension. IEEE, 218–229.

[65]

Moreno Laura, Bavota Gabriele, Di Penta Massimiliano, Oliveto Rocco, and Marcus Andrian. 2015. How can I use this method?. In International Conference on Software Engineering. ACM.

[66]

Mu-Woong Lee, Jong-Won Roh, Seung-won Hwang, and Sunghun Kim. 2010. Instant code clone search. In International Symposium on Foundations of Software Engineering. 167–176.

Digital Library

[67]

Shin-Jie Lee, Xavier Lin, Wu-Chen Su, and Hsi-Min Chen. 2018. A comment-driven approach to API usage patterns discovery and search. Journal of Internet Technology 19, 5 (2018), 1587–1601.

[68]

Otávio A. L. Lemos, Adriano C. de Paula, Felipe C. Zanichelli, and Cristina V. Lopes. 2014. Thesaurus-based automatic query expansion for interface-driven code search. In Working Conference on Mining Software Repositories. 212–221.

[69]

Otávio Augusto Lazzarini Lemos, Sushil Bajracharya, Joel Ossher, Paulo Cesar Masiero, and Cristina Lopes. 2011. A test-driven approach to code search and its application to the reuse of auxiliary functionality. Information and Software Technology 53, 4 (2011), 294–306.

Digital Library

[70]

Otávio Augusto Lazzarini Lemos, Adriano Carvalho de Paula, Gustavo Konishi, Sushil Krishna Bajracharya, Joel Ossher, and Cristina Videira Lopes. 2014. Thesaurus-based tag clouds for test-driven code search.Journal of Universal Computer Science 20, 5 (2014), 772–796.

[71]

Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, 10. 707–710.

[72]

Hongwei Li, Zhenchang Xing, Xin Peng, and Wenyun Zhao. 2013. What help do developers seek, when and how?. In Working Conference on Reverse Engineering. IEEE, 142–151.

[73]

Wei Li, Shuhan Yan, Beijun Shen, and Yuting Chen. 2019. Reinforcement learning of code search sessions. In Asia-Pacific Software Engineering Conference. IEEE, 458–465.

[74]

Xuan Li, Zerui Wang, Qianxiang Wang, Shoumeng Yan, Tao Xie, and Hong Mei. 2016. Relationship-aware code search for JavaScript frameworks. In International Symposium on Foundations of Software Engineering. ACM, 690–701.

Digital Library

[75]

Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: Mining and searching Internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2 (2009), 300–336.

Digital Library

[76]

Chao Liu, Cuiyun Gao, Xin Xia, David Lo, John Grundy, and Xiaohu Yang. 2020. On the replicability and reproducibility of deep learning in software engineering. arXiv preprint arXiv:2006.14244 (2020).

[77]

Chao Liu, Xin Xia, David Lo, Zhiwei Liu, Ahmed E. Hassan, and Shanping Li. 2020. Simplifying deep-learning-based model for code search. arXiv preprint arXiv:2005.14373 (2020).

[78]

Chao Liu, Dan Yang, Xin Xia, Meng Yan, and Xiaohong Zhang. 2018. Cross-project change-proneness prediction. In Annual Computer Software and Applications Conference, Vol. 1. IEEE, 64–73.

[79]

Chao Liu, Dan Yang, Xin Xia, Meng Yan, and Xiaohong Zhang. 2019. A two-phase transfer learning model for cross-project defect prediction. Information and Software Technology 107 (2019), 125–136.

[80]

Chao Liu, Dan Yang, Xiaohong Zhang, Haibo Hu, Jed Barson, and Baishakhi Ray. 2018. A recommender system for developer onboarding. In International Conference on Software Engineering: Companion. 319–320.

[81]

Chao Liu, Dan Yang, Xiaohong Zhang, Baishakhi Ray, and Md. Masudur Rahman. 2018. Recommending GitHub projects for developer onboarding. IEEE Access 6 (2018), 52082–52094.

[82]

Jason Liu, Seohyun Kim, Vijayaraghavan Murali, Swarat Chaudhuri, and Satish Chandra. 2019. Neural query expansion for code search. In International Workshop on Machine Learning and Programming Languages. 29–37.

Digital Library

[83]

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. TBar: Revisiting template-based automated program repair. In ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.

Digital Library

[84]

Wenjian Liu, Xin Peng, Zhenchang Xing, Junyi Li, Bing Xie, and Wenyun Zhao. 2018. Supporting exploratory code search with differencing and visualization. In International Conference on Software Analysis, Evolution and Reengineering. IEEE, 300–310.

[85]

Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code recommendation via structural code search. OOPSLA 3 (2019), 152.

[86]

Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. Codehow: Effective code search based on API understanding and extended Boolean model (e). In International Conference on Automated Software Engineering. IEEE, 260–270.

Digital Library

[87]

Lee Wei Mar, Ye-Chi Wu, and Hewijin Christine Jiau. 2011. Recommending proper API code examples for documentation purpose. In Asia-Pacific Software Engineering Conference. IEEE, 331–338.

Digital Library

[88]

Lee Martie, Thomas D. LaToza, and Andre van der Hoek. 2015. Codeexchange: Supporting reformulation of internet-scale code queries in context (T). In International Conference on Automated Software Engineering. IEEE, 24–35.

[89]

Lee Martie and Andre van der Hoek. 2015. Sameness: An experiment in code search. In Working Conference on Mining Software Repositories. IEEE, 76–87.

[90]

Michael McCandless, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Vol. 2. Manning Greenwich.

Digital Library

[91]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding relevant functions and their usage. In International Conference on Software Engineering. 111–120.

Digital Library

[92]

Collin McMillan, Negar Hariri, Denys Poshyvanyk, Jane Cleland-Huang, and Bamshad Mobasher. 2012. Recommending source code for use in rapid software prototypes. In International Conference on Software Engineering. IEEE Press, 848–858.

[93]

Collin McMillan, Denys Poshyvanyk, Mark Grechanik, Qing Xie, and Chen Fu. 2013. Portfolio: Searching for relevant functions and their usages in millions of lines of code. ACM Transactions on Software Engineering and Methodology 22, 4 (2013), 1–30.

Digital Library

[94]

Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. 2013. A machine learning framework for programming by example. In International Conference on Machine Learning. PMLR, 187–195.

[95]

Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. IEEE Computer Society 34, 8 (2010), 1388–1429.

[96]

Leann Myers and Maria J. Sirois. 2004. S. Pearman correlation coefficients, differences between. Encyclopedia of Statistical Sciences (2004).

[97]

Brent D. Nichols. 2010. Augmented bug localization using past bug information. In Annual Southeast Regional Conference. 1–6.

Digital Library

[98]

Liming Nie, He Jiang, Zhilei Ren, Zeyi Sun, and Xiaochen Li. 2016. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing 9, 5 (2016), 771–783.

[99]

Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2009), 1345–1359.

Digital Library

[100]

Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology 64 (2015), 1–18.

Digital Library

[101]

Denys Poshyvanyk and Mark Grechanik. 2009. Creating and evolving software by searching, selecting and synthesizing relevant source code. In International Conference on Software Engineering-Companion. IEEE, 283–286.

[102]

Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. Swim: Synthesizing what I mean. Code search and idiomatic snippet synthesis. In International Conference on Software Engineering. IEEE, 357–367.

Digital Library

[103]

Chaiyong Ragkhitwetsagul and Jens Krinke. 2019. Siamese: Scalable and incremental code clone search via multiple code representations. Empirical Software Engineering 24, 4 (2019), 2236–2284.

Digital Library

[104]

Md. Masudur Rahman, Jed Barson, Sydney Paul, Joshua Kayani, Federico Andrés Lois, Sebastián Fernandez Quezada, Christopher Parnin, Kathryn T. Stolee, and Baishakhi Ray. 2018. Evaluating how developers use general-purpose web-search for code retrieval. In International Conference on Mining Software Repositories. ACM, 465–475.

Digital Library

[105]

Sukanya Ratanotayanon, Hye Jung Choi, and Susan Elliott Sim. 2010. My repository runneth over: An empirical study on diversifying data sources to improve feature search. In International Conference on Program Comprehension. IEEE, 206–215.

Digital Library

[106]

Steven P. Reiss. 2009. Semantics-based code search. In International Conference on Software Engineering. IEEE, 243–253.

Digital Library

[107]

Steven P. Reiss, Yun Miao, and Qi Xin. 2018. Seeking the user interface. Automated Software Engineering 25, 1 (2018), 157–193.

Digital Library

[108]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers, Inc.

[109]

Barbara Rosario. 2000. Latent semantic indexing: An overview. Techn. Rep. INFOSYS 240 (2000), 1–16.

[110]

Chanchal Kumar Roy and James R. Cordy. 2007. A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.

[111]

Chanchal K. Roy, James R. Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7 (2009), 470–495.

Digital Library

[112]

Julia Rubin and Marsha Chechik. 2013. A survey of feature location techniques. In Domain Engineering. Springer, 29–58.

[113]

Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In International Workshop on Machine Learning and Programming Languages. 31–41.

Digital Library

[114]

Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How developers search for code: A case study. In Joint Meeting on Foundations of Software Engineering. ACM, 191–201.

Digital Library

[115]

Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In International Conference on Software Engineering. 1157–1168.

[116]

Abdullah Sheneamer and Jugal Kalita. 2016. A survey of software clone detection techniques. International Journal of Computer Applications 137, 10 (2016), 1–21.

[117]

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In International Conference on Program Comprehension.

Digital Library

[118]

Susan Elliott Sim, Medha Umarji, Sukanya Ratanotayanon, and Cristina V. Lopes. 2011. How well do search engines support code retrieval on the web?ACM Transactions on Software Engineering and Methodology 21, 1 (2011), 1–25.

[119]

Janice Singer, Timothy Lethbridge, Norman Vinson, and Nicolas Anquetil. 2010. An examination of software engineering work practices. In CASCON First Decade High Impact Papers. IBM Corp., 174–188.

[120]

Raphael Sirres, Tegawendé F. Bissyandé, Dongsun Kim, David Lo, Jacques Klein, Kisub Kim, and Yves Le Traon. 2018. Augmenting and structuring user queries to support efficient free-form code search. Empirical Software Engineering 23, 5 (2018), 2622–2654.

Digital Library

[121]

Bunyamin Sisman and Avinash C. Kak. 2013. Assisting code search with automatic query reformulation for bug localization. In Working Conference on Mining Software Repositories. IEEE, 309–318.

[122]

Jamie Starke, Chris Luce, and Jonathan Sillito. 2009. Searching and skimming: An exploratory study. In IEEE International Conference on Software Maintenance and Evolution. IEEE, 157–166.

[123]

Kathryn T. Stolee, Sebastian Elbaum, and Daniel Dobos. 2014. Solving the search for source code. ACM Transactions on Software Engineering and Methodology 23, 3 (2014), 26.

Digital Library

[124]

Kathryn T. Stolee, Sebastian Elbaum, and Matthew B. Dwyer. 2016. Code search with input/output queries: Generalizing, ranking, and assessment. Journal of Systems and Software 116 (2016), 35–48.

Digital Library

[125]

Rui Sun, Hui Liu, and Leping Li. 2019. Slicing based code recommendation for type-based instance retrieval. In International Conference on Software and Systems Reuse. Springer, 149–167.

[126]

Suresh Thummalapenta and Tao Xie. 2007. Parseweb: A programmer assistant for reusing open source code on the web. In International Conference on Automated Software Engineering. 204–213.

Digital Library

[127]

Suresh Thummalapenta and Tao Xie. 2009. Alattin: Mining alternative patterns for detecting neglected conditions. In International Conference on Automated Software Engineering. IEEE, 283–294.

Digital Library

[128]

Suresh Thummalapenta and Tao Xie. 2011. Alattin: Mining alternative patterns for defect detection. Automated Software Engineering 18, 3-4 (2011), 293–323.

Digital Library

[129]

Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. IGI global, 242–264.

[130]

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In International Conference on Automated Software Engineering. IEEE, 13–25.

Digital Library

[131]

Bei Wang, Ling Xu, Meng Yan, Chao Liu, and Ling Liu. 2020. Multi-dimension convolutional neural network for bug localization. IEEE Transactions on Services Computing (2020).

[132]

Jue Wang, Yingnong Dang, Hongyu Zhang, Kai Chen, Tao Xie, and Dongmei Zhang. 2013. Mining succinct and high-coverage API usage patterns from source code. In Working Conference on Mining Software Repositories. IEEE, 319–328.

[133]

Jianyong Wang and Jiawei Han. 2004. BIDE: Efficient mining of frequent closed sequences. In 20th International Conference on Data Engineering. IEEE, 79–90.

[134]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In International Conference on Automated Software Engineering. IEEE, 87–98.

Digital Library

[135]

Norman Wilde, Ross Huitt, and Scott Huitt. 1989. Dependency analysis tools: Reusable components for software maintenance. In Conference on Software Maintenance. IEEE, 126–131.

[136]

Huaiguang Wu and Yang Yang. 2019. Code search based on alteration intent. IEEE Access 7 (2019), 56796–56802.

[137]

Ho Chung Wu, Robert Wing Pong Luk, Kam Fai Wong, and Kui Lam Kwok. 2008. Interpreting TF-IDF term weights as making relevance decisions. ACM Transactions on Information Systems 26, 3 (2008), 1–37.

Digital Library

[138]

Xin Xia, Lingfeng Bao, David Lo, Pavneet Singh Kochhar, Ahmed E. Hassan, and Zhenchang Xing. 2017. What do developers search for on the web?Empirical Software Engineering 22, 6 (2017), 3149–3185.

[139]

Yingtao Xie, Tao Lin, and Hongyan Xu. 2019. User interface code retrieval: A novel visual-representation-aware approach. IEEE Access 7 (2019), 162756–162767.

[140]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In SIGSAC Conference on Computer and Communications Security. 363–376.

Digital Library

[141]

Yinxing Xue, Zhengzi Xu, Mahinthan Chandramohan, and Yang Liu. 2018. Accurate and scalable cross-architecture cross-os binary code search with emulation. IEEE Transactions on Software Engineering 45, 11 (2018), 1125–1149.

[142]

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries. In International Conference on Software Analysis, Evolution and Reengineering. IEEE, 344–354.

[143]

Yangrui Yang and Qing Huang. 2017. IECS: Intent-enforced code search via extended Boolean model. Journal of Intelligent & Fuzzy Systems 33, 4 (2017), 2565–2576.

Digital Library

[144]

Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code annotation for code retrieval with reinforcement learning. In The World Wide Web Conference. 2203–2214.

Digital Library

[145]

Wei Ye, Rui Xie, Jinglei Zhang, Tianxiang Hu, Xiaoyin Wang, and Shikun Zhang. 2020. Leveraging code generation to improve code retrieval and summarization via dual learning. In The World Wide Web Conference. 2309–2319.

Digital Library

[146]

Feng Zhang, Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Expanding queries for code search using semantically related API class-names. IEEE Transactions on Software Engineering 44, 11 (2017), 1070–1082.

[147]

Jingxuan Zhang, He Jiang, Zhilei Ren, Tao Zhang, and Zhiqiu Huang. 2019. Enriching API documentation with code samples and usage scenarios from crowd knowledge. IEEE Transactions on Software Engineering (2019).

[148]

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In International Conference on Software Engineering. IEEE, 783–794.

Digital Library

[149]

Jingtian Zhang, Sai Wu, Zeyuan Tan, Gang Chen, Zhushi Cheng, Wei Cao, Yusong Gao, and Xiaojie Feng. 2019. S3: A scalable in-memory skip-list index for key-value store. Proceedings of the VLDB Endowment 12, 12 (2019), 2183–2194.

Digital Library

[150]

Yu Zhang and Qiang Yang. 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).

[151]

Yu Zhang and Qiang Yang. 2018. An overview of multi-task learning. National Science Review 5, 1 (2018), 30–43.

[152]

Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. 2009. MAPO: Mining and recommending API usage patterns. In European Conference on Object-Oriented Programming. Springer, 318–343.

Digital Library

[153]

Qun Zou and Changquan Zhang. 2020. Query expansion via learning change sequences. International Journal of Knowledge-Based and Intelligent Engineering Systems 24, 2 (2020), 95–105.

Digital Library

Cited By

Mondal AHossain MRoy CRoy BSchneider K(2025)FSECAM: A contextual thematic approach for linking feature to multi-level software architectural componentsJournal of Systems and Software10.1016/j.jss.2024.112245219(112245)Online publication date: Jan-2025
https://doi.org/10.1016/j.jss.2024.112245
Hu HFang MLiu J(2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
https://doi.org/10.1016/j.infsof.2024.107589
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1186/s13677-024-00629-5
Show More Cited By

Index Terms

Opportunities and Challenges in Code Search Tools
1. Software and its engineering
  1. Software creation and management
    1. Search-based software engineering

Recommendations

Code Search: A Survey of Techniques for Finding Code
The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when ...
Big Code Search: A Bibliography
Code search is an essential task in software development. Developers often search the internet and other code databases for necessary source code snippets to ease the development efforts. Code search techniques also help learn programming as novice ...
Code Search is All You Need? Improving Code Suggestions with Code Search
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Modern integrated development environments (IDEs) provide various automated code suggestion techniques (e.g., code completion and code generation) to help developers improve their efficiency. Such techniques may retrieve similar code snippets from the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 54, Issue 9

December 2022

800 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3485140

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2021

Accepted: 01 July 2021

Revised: 01 June 2021

Received: 01 November 2020

Published in CSUR Volume 54, Issue 9

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Funding Sources

National Science Foundation of China
Key Research and Development Program of Zhejiang Province
National Research Foundation, Singapore
ARC Laureate Fellowship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
1,262
Total Downloads

Downloads (Last 12 months)249
Downloads (Last 6 weeks)20

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mondal AHossain MRoy CRoy BSchneider K(2025)FSECAM: A contextual thematic approach for linking feature to multi-level software architectural componentsJournal of Systems and Software10.1016/j.jss.2024.112245219(112245)Online publication date: Jan-2025
https://doi.org/10.1016/j.jss.2024.112245
Hu HFang MLiu J(2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
https://doi.org/10.1016/j.infsof.2024.107589
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1186/s13677-024-00629-5
Matute GNi WBarik TCheung AChasins S(2024)Syntactic Code Search with Sequence-to-Tree Matching: Supporting Syntactic Search with Incomplete Code FragmentsProceedings of the ACM on Programming Languages10.1145/36564608:PLDI(2051-2072)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656460
Sun WFang CGe YHu YChen YZhang QGe XLiu YChen Z(2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3656341
Wang WNing HZhang GLiu LWang Y(2024)Rocks Coding, Not Development: A Human-Centric, Experimental Evaluation of LLM-Supported SE TasksProceedings of the ACM on Software Engineering10.1145/36437581:FSE(699-721)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643758
Fan GChen SGao CXiao JZhang TFeng Z(2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641542
Wang SGeng MLin BSun ZWen MLiu YLi LBissyandé TMao X(2024)Fusing Code SearchersIEEE Transactions on Software Engineering10.1109/TSE.2024.340304250:7(1852-1866)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3403042
Pirapuraj PAfrath H(2024)Dsn2Code: An automated approach for similarity-based Software Architecture selection for Code reuse2024 International Research Conference on Smart Computing and Systems Engineering (SCSE)10.1109/SCSE61872.2024.10550890(1-6)Online publication date: 4-Apr-2024
https://doi.org/10.1109/SCSE61872.2024.10550890
Liu WChen GXie X(2024)FMCS: Improving Code Search by Multi-Modal Representation Fusion and Momentum Contrastive Learning2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00068(632-638)Online publication date: 1-Jul-2024
https://doi.org/10.1109/QRS62785.2024.00068
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents