research-article

Semantic-Enriched Code Knowledge Graph to Reveal Unknowns in Smart Contract Code Reuse

Authors:

Zhenchang Xing,

Changjing Wang,

Xin XiaAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 6

Article No.: 147, Pages 1 - 37

https://doi.org/10.1145/3597206

Published: 30 September 2023 Publication History

Abstract

Programmers who work with smart contract development often encounter challenges in reusing code from repositories. This is due to the presence of two unknowns that can lead to non-functional and functional failures. These unknowns are implicit collaborations between functions and subtle differences among similar functions. Current code mining methods can extract syntax and semantic knowledge (known knowledge), but they cannot uncover these unknowns due to a significant gap between the known and the unknown. To address this issue, we formulate knowledge acquisition as a knowledge deduction task and propose an analytic flow that uses the function clone as a bridge to gradually deduce the known knowledge into the problem-solving knowledge that can reveal the unknowns. This flow comprises five methods: clone detection, co-occurrence probability calculation, function usage frequency accumulation, description propagation, and control flow graph annotation. This provides a systematic and coherent approach to knowledge deduction. We then structure all of the knowledge into a semantic-enriched code Knowledge Graph (KG) and integrate this KG into two software engineering tasks: code recommendation and crowd-scaled coding practice checking. As a proof of concept, we apply our approach to 5,140 smart contract files available on Etherscan.io and confirm high accuracy of our KG construction steps. In our experiments, our code KG effectively improved code recommendation accuracy by 6% to 45%, increased diversity by 61% to 102%, and enhanced NDCG by 1% to 21%. Furthermore, compared to traditional analysis tools and the debugging-with-the-crowd method, our KG improved time efficiency by 30 to 380 seconds, vulnerability determination accuracy by 20% to 33%, and vulnerability fixing accuracy by 24% to 40% for novice developers who identified and fixed vulnerable smart contract functions.

References

[1]

Satoshi Nakamoto. 2008. Bitcoin: A Peer-to-Peer Electronic Cash System. Apple Books.

[2]

Daniel Davis Wood. 2014. Ethereum: A Secure Decentralized Generalised Transaction Ledger. Ethereum Project Yellow Paper. Scientific Research.

[3]

Christof Ferreira Torres, Mathis Baden, Robert Norvill, Beltran Borja Fiz Pontiveros, Hugo L. Jonker, and Sjouke Mauw. 2020. GIS: Shielding vulnerable smart contracts against attacks. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security.

Digital Library

[4]

Tai Duy Nguyen, Long H. Pham, Jun Sun, Yun Lin, and Quang Tran Minh. 2020. sFuzz: An efficient adaptive fuzzer for Solidity smart contracts. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20).778–788.

[5]

Vitalik Buterin. 2015. A Next Generation Smart Contract and Decentralized Application Platform. Retrieved June 1, 2023 from https://ethereum.org/en/whitepaper/.

[6]

Weiqin Zou, David Lo, Pavneet Singh Kochhar, Xuan-Bach D. Le, Xin Xia, Yang Feng, Zhenyu Chen, and Baowen Xu. 2021. Smart contract development: Challenges and opportunities. IEEE Transactions on Software Engineering 47 (2021), 2084–2106.

[7]

Zhipeng Gao, Lingxiao Jiang, Xin Xia, David Lo, and John C. Grundy. 2021. Checking smart contracts with structural code embedding. IEEE Transactions on Software Engineering 47 (2021), 2874–2891.

[8]

Ningyu He, Lei Wu, Haoyu Wang, Yao Guo, and Xuxian Jiang. 2020. Characterizing code clones in the Ethereum smart contract ecosystem. arXiv abs/1905.00272 (2020).

[9]

Gustavo Ansaldi Oliva, A. Hassan, and Zhen Ming Jack Jiang. 2020. An exploratory study of smart contracts in the Ethereum blockchain platform. Empirical Software Engineering 25 (2020), 1864–1904.

Digital Library

[10]

Péter Hegedűs. 2018. Towards analyzing the complexity landscape of Solidity based Ethereum smart contracts. In Proceedings of the 2018 IEEE/ACM 1st International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB’18).35–39.

[11]

Xiangping Chen, Peiyong Liao, Yixin Zhang, Yuan Huang, and Zibin Zheng. 2021. Understanding code reuse in smart contracts. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’21).470–479.

[12]

Jong-Hoon Lee, Seongho Yoon, and Hyuk Lee. 2022. SWC-based smart contract development guide research. In Proceedings of the 2022 24th International Conference on Advanced Communication Technology (ICACT’22).138–141.

[13]

Guang Yang, Ke Liu, Xiang Chen, Yanlin Zhou, Chi Yu, and Hao Lin. 2022. CCGIR: Information retrieval-based code comment generation method for smart contracts. Knowledge-Based Systems 237 (2022), 107858.

Digital Library

[14]

Kavitha Srinivas, I. Abdelaziz, Julian T. Dolby, and Jamie P. McCusker. 2020. Graph4Code: A machine interpretable knowledge graph for code. arXiv abs/2002.09440 (2020).

[15]

Amir Michail. 2000. Data mining library reuse patterns using generalized association rules. In Proceedings of the 2000 International Conference on Software Engineering (ICSE’00).167–176.

[16]

Junming Cao, Shouliang Yang, Wenhui Jiang, Hushuang Zeng, Beijun Shen, and Hao Zhong. 2020. BugPecker: Locating faulty methods with deep learning on revision graphs. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).1214–1218.

[17]

Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (2009), 333–389.

Digital Library

[18]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE’18).933–944.

[19]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. 2020. CodeBERT: A pre-trained model for programming and natural languages. arXiv abs/2002.08155 (2020).

[20]

Loi Luu, Duc-Hiep Chu, Hrishi Olickel, P. Saxena, and Aquinas Hobor. 2016. Making smart contracts smarter. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.

Digital Library

[21]

B. Mueller. n.d. Mythril–reversing and bug hunting framework for the Ethereum blockchain. GitHub. Retrieved June 1, 2023 from https://github.com/ConsenSys/mythril.

[22]

Petar Tsankov, Andrei Marian Dan, Dana Drachsler-Cohen, Arthur Gervais, Florian Buenzli, and Martin T. Vechev. 2018. Securify: Practical security analysis of smart contracts. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.

Digital Library

[23]

Martin Monperrus, Anthony Maia, Romain Rouvoy, and Lionel Seinturier. 2014. Debugging with the crowd: A debug recommendation system based on StackOverflow. ERCIM News 99 (2014), 26–27.

[24]

Zeli Wang, Hai Jin, Weiqi Dai, Kim-Kwang Raymond Choo, and Deqing Zou. 2021. Ethereum smart contract security research: Survey and future research opportunities. Frontiers of Computer Science 15 (2021), 1–18.

Digital Library

[25]

Josselin Feist, Gustavo Grieco, and Alex Groce. 2019. Slither: A static analysis framework for smart contracts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB’19).8–15.

Digital Library

[26]

Anna Vacca, Michele Fredella, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora. 2022. An empirical investigation on the trade-off between smart contract readability and gas consumption. In Proceedings of the 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC’22).214–224.

Digital Library

[27]

GitHub. n.d. GitHub Home Page. Retrieved June 1, 2023 from https://github.com.

[28]

Clinton Gormley and Zachary J. Tong. 2015. Elasticsearch: The Definitive Guide. O’Reilly Media.

[29]

Xiang Ling, Lingfei Wu, Sai Gang Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, and Shouling Ji. 2020. Deep graph matching and searching for semantic code retrieval. ACM Transactions on Knowledge Discovery from Data 15 (2020), 1–21.

Digital Library

[30]

Weisong Sun, Chunrong Fang, Yuchen Chen, Guanhong Tao, Ting Han, and Quanjun Zhang. 2022. Code search based on context-aware code translation. In Proceedings of the 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE’22).388–400.

[31]

Chao Liu, Xin Xia, David Lo, Zhiwei Liu, A. Hassan, and Shanping Li. 2020. CodeMatcher: Searching code based on sequential semantics of important query words. ACM Transactions on Software Engineering and Methodology 31 (2020), Article 12, 37 pages.

[32]

Hao Yu, Yin Zhang, Yuli Zhao, and Bin Zhang. 2022. Incorporating code structure and quality in deep code search. Applied Sciences 12, 4 (2022), 2051.

[33]

Pasquale Salza, Christoph Schwizer, Jian Gu, and Harald C. Gall. 2021. On the effectiveness of transfer learning for code search. arXiv abs/2108.05890 (2021).

[34]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, et al. 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv abs/2102.04664 (2021).

[35]

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20,000+ web queries for code search and question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

[36]

Michael del Castillo. 2016. The DAO attacked: Code issue leads to 60 million ether theft. CoinDesk. Retrieved June 1, 2023 from https://www.coindesk.com/markets/2016/06/17/the-dao-attacked-code-issue-leads-to-60-million-ether-theft/.

[37]

Nicola Atzei, Massimo Bartoletti, and Tiziana Cimoli. 2017. A survey of attacks on Ethereum smart contracts (SoK). In Proceedings of the 6th International Conference on Principles of Security and Trust (POST’17). 164–186.

Digital Library

[38]

J. Richard Landis and Gary G. Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33 2 (1977), 363–74.

[39]

Ravindra Pal Singh and Naurang Singh Mangat. 1996. Elements of Survey Sampling. Texts in the Mathematical Sciences, Vol. 15. Springer.

[40]

B. L. Welch. 1947. The generalization of student’s problems when several different population variances are involved. Biometrika 34 (1947), 28–35.

[41]

Hongwei Li, Sirui Li, Jiamou Sun, Zhenchang Xing, Xin Peng, Mingwei Liu, and Xuejiao Zhao. 2018. Improving API caveats accessibility by mining API caveats knowledge graph. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME’18).183–193.

[42]

Jiamou Sun, Zhenchang Xing, Rui Chu, Heilai Bai, Jinshui Wang, and Xin Peng. 2019. Know-how in programming tasks: From textual tutorials to task-oriented knowledge graph. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME’19).257–268.

[43]

Xiaoxue Ren, Zhenchang Xing, Xin Xia, Guoqiang Li, and Jianling Sun. 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19).151–162.

Digital Library

[44]

Mingwei Liu, Xin Peng, Andrian Marcus, Zhenchang Xing, Wenkai Xie, Shuangshuang Xing, and Yang Liu. 2019. Generating query-specific class API summaries. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.

[45]

Yang Liu, Mingwei Liu, Xin Peng, Christoph Treude, Zhenchang Xing, and Xiaoxin Zhang. 2020. Generating concept based API element comparison using a knowledge graph. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).834–845.

[46]

Martin P. Robillard, Eric Bodden, David Kawrykow, Mira Mezini, and Tristan Ratchford. 2013. Automated API property inference techniques. IEEE Transactions on Software Engineering 39 (2013), 613–637.

Digital Library

[47]

Monperrus Martin, Marcel Bruch, and Mira Mezini. 2010. Detecting missing method calls in object-oriented software. In Proceedings of the European Conference on Object-Oriented Programming.

[48]

Marcel Bruch, Monperrus Martin, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’09). 213–222.

Digital Library

[49]

Edmund Wong, Taiyue Liu, and Lin Tan. 2015. CloCom: Mining existing source code for automatic comment generation. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER’15).380–389.

[50]

Chong Wang, Xin Peng, Mingwei Liu, Zhenchang Xing, Xue Bai, Bing Xie, and Tuo Wang. 2019. A learning-based approach for automatic construction of domain glossary from source code and documentation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.

[51]

Zeqi Lin, Yanzhen Zou, Junfeng Zhao, and Bing Xie. 2017. Improving software text retrieval using conceptual knowledge in source code. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17).123–134.

[52]

Susan Elliott Sim, Medha Umarji, Sukanya Ratanotayanon, and Cristina V. Lopes. 2011. How well do search engines support code retrieval on the web? ACM Transactions on Software Engineering and Methodology 21 (2011), Article 4, 25 pages.

Digital Library

[53]

Sushil Krishna Bajracharya, Trung Chi Ngo, Erik J. Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina V. Lopes. 2006. Sourcerer: A search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN Symposium on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA’06). 681–682.

Digital Library

[54]

Qing Huang and Guoqing Wu. 2019. Enhance code search via reformulating queries with evolving contexts. Automated Software Engineering 26 (2019), 705–732.

[55]

Qing Huang and Huaiguang Wu. 2019. QE-integrating framework based on GitHub knowledge and SVM ranking. Science China Information Sciences 62 (2019), 1–16.

[56]

Qing Huang, Yang Yang, and Ming Cheng. 2019. Deep learning the semantics of change sequences for query expansion. Software: Practice and Experience 49 (2019), 1600–1617.

[57]

Qing Huang, Yangrui Yang, Xue Zhan, Hongyan Wan, and Guoqing Wu. 2018. Query expansion based on statistical learning from code changes. Software: Practice and Experience 48 (2018), 1333–1351.

[58]

Haoran Niu, Iman Keivanloo, and Ying Zou. 2015. Learning to rank code examples for code search engines. Empirical Software Engineering 22 (2015), 259–291.

Digital Library

[59]

Tam The Nguyen, Phong Minh Vu, and Tung Thanh Nguyen. 2019. Recommendation of exception handling code in mobile app development. arXiv abs/1908.06567 (2019).

[60]

Xiaoning Liu, Beijun Shen, Hao Zhong, and Jiangang Zhu. 2016. EXPSOL: Recommending online threads for exception-related bug reports. In Proceedings of the 2016 23rd Asia-Pacific Software Engineering Conference (APSEC’16).25–32.

Cited By

Zhang XChen ZCao YChen LZhou Y(2024)Multi-Intent Inline Code Comment Generation via Large Language ModelInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450005034:06(845-868)Online publication date: 23-Mar-2024
https://doi.org/10.1142/S0218194024500050
Wu DZhang HFeng YDong ZSun Y(2024)The future of API analyticsAutomated Software Engineering10.1007/s10515-024-00442-z31:2Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1007/s10515-024-00442-z

Index Terms

Semantic-Enriched Code Knowledge Graph to Reveal Unknowns in Smart Contract Code Reuse
1. Security and privacy
  1. Software and application security
    1. Software security engineering
2. Software and its engineering
  1. Software creation and management
    1. Search-based software engineering
    2. Software development techniques
      1. Reusability
  2. Software notations and tools
    1. Formal language definitions
      1. Semantics

Recommendations

Studying differentiated code to support smart contract update
Abstract
Smart contracts have received a lot of attention. A smart contract is a program that runs on a blockchain. Some recent studies reveal that most of the smart contracts on the Ethereum blockchain are highly similar. An inexperienced smart contract ...
Recommending differentiated code to support smart contract update
ICPC '19: Proceedings of the 27th International Conference on Program Comprehension

Blockchain has attracted wide attention. A smart contract is a program that runs on the blockchain, and there is evidence that most of the smart contracts on the Ethereum are highly similar, as they share lots of repetitive code. In this study, we ...
Aroma: code recommendation via structural code search

Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 32, Issue 6

November 2023

949 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/3625557

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2023

Online AM: 22 May 2023

Accepted: 27 April 2023

Revised: 19 April 2023

Received: 30 July 2022

Published in TOSEM Volume 32, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Nature Science Foundation of China
Science and Technology Key Project of Education Department of Jiangxi Province
Graduate Innovative Special Fund Projects of Jiangxi Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
521
Total Downloads

Downloads (Last 12 months)406
Downloads (Last 6 weeks)29

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XChen ZCao YChen LZhou Y(2024)Multi-Intent Inline Code Comment Generation via Large Language ModelInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450005034:06(845-868)Online publication date: 23-Mar-2024
https://doi.org/10.1142/S0218194024500050
Wu DZhang HFeng YDong ZSun Y(2024)The future of API analyticsAutomated Software Engineering10.1007/s10515-024-00442-z31:2Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1007/s10515-024-00442-z

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents