Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Semantic-Enriched Code Knowledge Graph to Reveal Unknowns in Smart Contract Code Reuse

Published: 30 September 2023 Publication History

Abstract

Programmers who work with smart contract development often encounter challenges in reusing code from repositories. This is due to the presence of two unknowns that can lead to non-functional and functional failures. These unknowns are implicit collaborations between functions and subtle differences among similar functions. Current code mining methods can extract syntax and semantic knowledge (known knowledge), but they cannot uncover these unknowns due to a significant gap between the known and the unknown. To address this issue, we formulate knowledge acquisition as a knowledge deduction task and propose an analytic flow that uses the function clone as a bridge to gradually deduce the known knowledge into the problem-solving knowledge that can reveal the unknowns. This flow comprises five methods: clone detection, co-occurrence probability calculation, function usage frequency accumulation, description propagation, and control flow graph annotation. This provides a systematic and coherent approach to knowledge deduction. We then structure all of the knowledge into a semantic-enriched code Knowledge Graph (KG) and integrate this KG into two software engineering tasks: code recommendation and crowd-scaled coding practice checking. As a proof of concept, we apply our approach to 5,140 smart contract files available on Etherscan.io and confirm high accuracy of our KG construction steps. In our experiments, our code KG effectively improved code recommendation accuracy by 6% to 45%, increased diversity by 61% to 102%, and enhanced NDCG by 1% to 21%. Furthermore, compared to traditional analysis tools and the debugging-with-the-crowd method, our KG improved time efficiency by 30 to 380 seconds, vulnerability determination accuracy by 20% to 33%, and vulnerability fixing accuracy by 24% to 40% for novice developers who identified and fixed vulnerable smart contract functions.

References

[1]
Satoshi Nakamoto. 2008. Bitcoin: A Peer-to-Peer Electronic Cash System. Apple Books.
[2]
Daniel Davis Wood. 2014. Ethereum: A Secure Decentralized Generalised Transaction Ledger. Ethereum Project Yellow Paper. Scientific Research.
[3]
Christof Ferreira Torres, Mathis Baden, Robert Norvill, Beltran Borja Fiz Pontiveros, Hugo L. Jonker, and Sjouke Mauw. 2020. GIS: Shielding vulnerable smart contracts against attacks. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security.
[4]
Tai Duy Nguyen, Long H. Pham, Jun Sun, Yun Lin, and Quang Tran Minh. 2020. sFuzz: An efficient adaptive fuzzer for Solidity smart contracts. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20).778–788.
[5]
Vitalik Buterin. 2015. A Next Generation Smart Contract and Decentralized Application Platform. Retrieved June 1, 2023 from https://ethereum.org/en/whitepaper/.
[6]
Weiqin Zou, David Lo, Pavneet Singh Kochhar, Xuan-Bach D. Le, Xin Xia, Yang Feng, Zhenyu Chen, and Baowen Xu. 2021. Smart contract development: Challenges and opportunities. IEEE Transactions on Software Engineering 47 (2021), 2084–2106.
[7]
Zhipeng Gao, Lingxiao Jiang, Xin Xia, David Lo, and John C. Grundy. 2021. Checking smart contracts with structural code embedding. IEEE Transactions on Software Engineering 47 (2021), 2874–2891.
[8]
Ningyu He, Lei Wu, Haoyu Wang, Yao Guo, and Xuxian Jiang. 2020. Characterizing code clones in the Ethereum smart contract ecosystem. arXiv abs/1905.00272 (2020).
[9]
Gustavo Ansaldi Oliva, A. Hassan, and Zhen Ming Jack Jiang. 2020. An exploratory study of smart contracts in the Ethereum blockchain platform. Empirical Software Engineering 25 (2020), 1864–1904.
[10]
Péter Hegedűs. 2018. Towards analyzing the complexity landscape of Solidity based Ethereum smart contracts. In Proceedings of the 2018 IEEE/ACM 1st International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB’18).35–39.
[11]
Xiangping Chen, Peiyong Liao, Yixin Zhang, Yuan Huang, and Zibin Zheng. 2021. Understanding code reuse in smart contracts. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’21).470–479.
[12]
Jong-Hoon Lee, Seongho Yoon, and Hyuk Lee. 2022. SWC-based smart contract development guide research. In Proceedings of the 2022 24th International Conference on Advanced Communication Technology (ICACT’22).138–141.
[13]
Guang Yang, Ke Liu, Xiang Chen, Yanlin Zhou, Chi Yu, and Hao Lin. 2022. CCGIR: Information retrieval-based code comment generation method for smart contracts. Knowledge-Based Systems 237 (2022), 107858.
[14]
Kavitha Srinivas, I. Abdelaziz, Julian T. Dolby, and Jamie P. McCusker. 2020. Graph4Code: A machine interpretable knowledge graph for code. arXiv abs/2002.09440 (2020).
[15]
Amir Michail. 2000. Data mining library reuse patterns using generalized association rules. In Proceedings of the 2000 International Conference on Software Engineering (ICSE’00).167–176.
[16]
Junming Cao, Shouliang Yang, Wenhui Jiang, Hushuang Zeng, Beijun Shen, and Hao Zhong. 2020. BugPecker: Locating faulty methods with deep learning on revision graphs. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).1214–1218.
[17]
Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (2009), 333–389.
[18]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE’18).933–944.
[19]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, et al. 2020. CodeBERT: A pre-trained model for programming and natural languages. arXiv abs/2002.08155 (2020).
[20]
Loi Luu, Duc-Hiep Chu, Hrishi Olickel, P. Saxena, and Aquinas Hobor. 2016. Making smart contracts smarter. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.
[21]
B. Mueller. n.d. Mythril–reversing and bug hunting framework for the Ethereum blockchain. GitHub. Retrieved June 1, 2023 from https://github.com/ConsenSys/mythril.
[22]
Petar Tsankov, Andrei Marian Dan, Dana Drachsler-Cohen, Arthur Gervais, Florian Buenzli, and Martin T. Vechev. 2018. Securify: Practical security analysis of smart contracts. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.
[23]
Martin Monperrus, Anthony Maia, Romain Rouvoy, and Lionel Seinturier. 2014. Debugging with the crowd: A debug recommendation system based on StackOverflow. ERCIM News 99 (2014), 26–27.
[24]
Zeli Wang, Hai Jin, Weiqi Dai, Kim-Kwang Raymond Choo, and Deqing Zou. 2021. Ethereum smart contract security research: Survey and future research opportunities. Frontiers of Computer Science 15 (2021), 1–18.
[25]
Josselin Feist, Gustavo Grieco, and Alex Groce. 2019. Slither: A static analysis framework for smart contracts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB’19).8–15.
[26]
Anna Vacca, Michele Fredella, Andrea Di Sorbo, Corrado Aaron Visaggio, and Gerardo Canfora. 2022. An empirical investigation on the trade-off between smart contract readability and gas consumption. In Proceedings of the 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC’22).214–224.
[27]
GitHub. n.d. GitHub Home Page. Retrieved June 1, 2023 from https://github.com.
[28]
Clinton Gormley and Zachary J. Tong. 2015. Elasticsearch: The Definitive Guide. O’Reilly Media.
[29]
Xiang Ling, Lingfei Wu, Sai Gang Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, and Shouling Ji. 2020. Deep graph matching and searching for semantic code retrieval. ACM Transactions on Knowledge Discovery from Data 15 (2020), 1–21.
[30]
Weisong Sun, Chunrong Fang, Yuchen Chen, Guanhong Tao, Ting Han, and Quanjun Zhang. 2022. Code search based on context-aware code translation. In Proceedings of the 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE’22).388–400.
[31]
Chao Liu, Xin Xia, David Lo, Zhiwei Liu, A. Hassan, and Shanping Li. 2020. CodeMatcher: Searching code based on sequential semantics of important query words. ACM Transactions on Software Engineering and Methodology 31 (2020), Article 12, 37 pages.
[32]
Hao Yu, Yin Zhang, Yuli Zhao, and Bin Zhang. 2022. Incorporating code structure and quality in deep code search. Applied Sciences 12, 4 (2022), 2051.
[33]
Pasquale Salza, Christoph Schwizer, Jian Gu, and Harald C. Gall. 2021. On the effectiveness of transfer learning for code search. arXiv abs/2108.05890 (2021).
[34]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, et al. 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv abs/2102.04664 (2021).
[35]
Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20,000+ web queries for code search and question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
[36]
Michael del Castillo. 2016. The DAO attacked: Code issue leads to 60 million ether theft. CoinDesk. Retrieved June 1, 2023 from https://www.coindesk.com/markets/2016/06/17/the-dao-attacked-code-issue-leads-to-60-million-ether-theft/.
[37]
Nicola Atzei, Massimo Bartoletti, and Tiziana Cimoli. 2017. A survey of attacks on Ethereum smart contracts (SoK). In Proceedings of the 6th International Conference on Principles of Security and Trust (POST’17). 164–186.
[38]
J. Richard Landis and Gary G. Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33 2 (1977), 363–74.
[39]
Ravindra Pal Singh and Naurang Singh Mangat. 1996. Elements of Survey Sampling. Texts in the Mathematical Sciences, Vol. 15. Springer.
[40]
B. L. Welch. 1947. The generalization of student’s problems when several different population variances are involved. Biometrika 34 (1947), 28–35.
[41]
Hongwei Li, Sirui Li, Jiamou Sun, Zhenchang Xing, Xin Peng, Mingwei Liu, and Xuejiao Zhao. 2018. Improving API caveats accessibility by mining API caveats knowledge graph. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME’18).183–193.
[42]
Jiamou Sun, Zhenchang Xing, Rui Chu, Heilai Bai, Jinshui Wang, and Xin Peng. 2019. Know-how in programming tasks: From textual tutorials to task-oriented knowledge graph. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME’19).257–268.
[43]
Xiaoxue Ren, Zhenchang Xing, Xin Xia, Guoqiang Li, and Jianling Sun. 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19).151–162.
[44]
Mingwei Liu, Xin Peng, Andrian Marcus, Zhenchang Xing, Wenkai Xie, Shuangshuang Xing, and Yang Liu. 2019. Generating query-specific class API summaries. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
[45]
Yang Liu, Mingwei Liu, Xin Peng, Christoph Treude, Zhenchang Xing, and Xiaoxin Zhang. 2020. Generating concept based API element comparison using a knowledge graph. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).834–845.
[46]
Martin P. Robillard, Eric Bodden, David Kawrykow, Mira Mezini, and Tristan Ratchford. 2013. Automated API property inference techniques. IEEE Transactions on Software Engineering 39 (2013), 613–637.
[47]
Monperrus Martin, Marcel Bruch, and Mira Mezini. 2010. Detecting missing method calls in object-oriented software. In Proceedings of the European Conference on Object-Oriented Programming.
[48]
Marcel Bruch, Monperrus Martin, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’09). 213–222.
[49]
Edmund Wong, Taiyue Liu, and Lin Tan. 2015. CloCom: Mining existing source code for automatic comment generation. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER’15).380–389.
[50]
Chong Wang, Xin Peng, Mingwei Liu, Zhenchang Xing, Xue Bai, Bing Xie, and Tuo Wang. 2019. A learning-based approach for automatic construction of domain glossary from source code and documentation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
[51]
Zeqi Lin, Yanzhen Zou, Junfeng Zhao, and Bing Xie. 2017. Improving software text retrieval using conceptual knowledge in source code. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17).123–134.
[52]
Susan Elliott Sim, Medha Umarji, Sukanya Ratanotayanon, and Cristina V. Lopes. 2011. How well do search engines support code retrieval on the web? ACM Transactions on Software Engineering and Methodology 21 (2011), Article 4, 25 pages.
[53]
Sushil Krishna Bajracharya, Trung Chi Ngo, Erik J. Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina V. Lopes. 2006. Sourcerer: A search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN Symposium on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA’06). 681–682.
[54]
Qing Huang and Guoqing Wu. 2019. Enhance code search via reformulating queries with evolving contexts. Automated Software Engineering 26 (2019), 705–732.
[55]
Qing Huang and Huaiguang Wu. 2019. QE-integrating framework based on GitHub knowledge and SVM ranking. Science China Information Sciences 62 (2019), 1–16.
[56]
Qing Huang, Yang Yang, and Ming Cheng. 2019. Deep learning the semantics of change sequences for query expansion. Software: Practice and Experience 49 (2019), 1600–1617.
[57]
Qing Huang, Yangrui Yang, Xue Zhan, Hongyan Wan, and Guoqing Wu. 2018. Query expansion based on statistical learning from code changes. Software: Practice and Experience 48 (2018), 1333–1351.
[58]
Haoran Niu, Iman Keivanloo, and Ying Zou. 2015. Learning to rank code examples for code search engines. Empirical Software Engineering 22 (2015), 259–291.
[59]
Tam The Nguyen, Phong Minh Vu, and Tung Thanh Nguyen. 2019. Recommendation of exception handling code in mobile app development. arXiv abs/1908.06567 (2019).
[60]
Xiaoning Liu, Beijun Shen, Hao Zhong, and Jiangang Zhu. 2016. EXPSOL: Recommending online threads for exception-related bug reports. In Proceedings of the 2016 23rd Asia-Pacific Software Engineering Conference (APSEC’16).25–32.

Cited By

View all
  • (2024)Multi-Intent Inline Code Comment Generation via Large Language ModelInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450005034:06(845-868)Online publication date: 23-Mar-2024
  • (2024)The future of API analyticsAutomated Software Engineering10.1007/s10515-024-00442-z31:2Online publication date: 9-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 6
November 2023
949 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3625557
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2023
Online AM: 22 May 2023
Accepted: 27 April 2023
Revised: 19 April 2023
Received: 30 July 2022
Published in TOSEM Volume 32, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Smart contract
  2. code knowledge graph
  3. knowledge deduction
  4. code recommendation
  5. crowd-scale coding practice checking

Qualifiers

  • Research-article

Funding Sources

  • National Nature Science Foundation of China
  • Science and Technology Key Project of Education Department of Jiangxi Province
  • Graduate Innovative Special Fund Projects of Jiangxi Province

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)406
  • Downloads (Last 6 weeks)29
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-Intent Inline Code Comment Generation via Large Language ModelInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450005034:06(845-868)Online publication date: 23-Mar-2024
  • (2024)The future of API analyticsAutomated Software Engineering10.1007/s10515-024-00442-z31:2Online publication date: 9-Jun-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media