Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Flow2Vec: value-flow-based precise code embedding

Published: 13 November 2020 Publication History

Abstract

Code embedding, as an emerging paradigm for source code analysis, has attracted much attention over the past few years. It aims to represent code semantics through distributed vector representations, which can be used to support a variety of program analysis tasks (e.g., code summarization and semantic labeling). However, existing code embedding approaches are intraprocedural, alias-unaware and ignoring the asymmetric transitivity of directed graphs abstracted from source code, thus they are still ineffective in preserving the structural information of code.
This paper presents Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows). By approximating the high-order proximity, i.e., the asymmetric transitivity of value-flows, Flow2Vec embeds control-flows and alias-aware data-flows of a program in a low-dimensional vector space. Our value-flow embedding is formulated as matrix multiplication to preserve context-sensitive transitivity through CFL reachability by filtering out infeasible value-flow paths. We have evaluated Flow2Vec using 32 popular open-source projects. Results from our experiments show that Flow2Vec successfully boosts the performance of two recent code embedding approaches codevec and codeseq for two client applications, i.e., code classification and code summarization. For code classification, Flow2Vec improves codevec with an average increase of 21.2%, 20.1% and 20.7% in precision, recall and F1, respectively. For code summarization, Flow2Vec outperforms codeseq by an average of 13.2%, 18.8% and 16.0% in precision, recall and F1, respectively.

Supplementary Material

Auxiliary Presentation Video (oopsla20main-p687-p-video.mp4)
This is a presentation video of my talk at OOPSLA 2020 on our paper accepted in the research track. This paper presents Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows). By approximating the high-order proximity, i.e., the asymmetric transitivity of value-flows, Flow2Vec embeds control-flows and alias-aware data-flows of a program in a low-dimensional vector space. Our value-flow embedding is formulated as matrix multiplication to preserve context-sensitive transitivity through CFL-reachability by filtering out infeasible value-flow paths.
MP4 File (sample-mp4-file.mp4)
123

References

[1]
Mithun Acharya and Brian Robinson. 2011. Practical change impact analysis based on static program slicing for industrial software systems. In ICSE'11. ACM, 746ś755.
[2]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names. In FSE' 15. 38ś49.
[3]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. ( 2018 ).
[4]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In ICML '16. 2091ś2100.
[5]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019a. code2seq: Generating sequences from structured representations of code. ICLR ' 19 ( 2019 ).
[6]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A General Path-Based Representation for Predicting Program Properties. In PLDI ' 18. 404ś419. https://doi.org/10.1145/3192366.3192412
[7]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019b. code2vec: Learning distributed representations of code. ACM POPL 3 ( 2019 ), 40.
[8]
Lars Ole Andersen. 1994. Program analysis and specialization for the C programming language. Ph.D. Dissertation. University of Cophenhagen.
[9]
George Balatsouras and Yannis Smaragdakis. 2016. Structure-sensitive points-to analysis for C and C++. In SAS '16. Springer, 84ś104.
[10]
Mohamad Barbar, Yulei Sui, and Shiping Chen. 2020. Flow-Sensitive Type-Based Heap Cloning. In ECOOP '20.
[11]
Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NeurIPS ' 02. 585ś591.
[12]
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: a learnable representation of code semantics. In NeurIPS ' 18. 3585ś3597.
[13]
Rastislav Bodík and Sadun Anik. 1998. Path-sensitive value-flow analysis. In POPL '98. 237ś251.
[14]
Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. TKDE '18 30, 9 ( 2018 ), 1616ś1637.
[15]
Gerardo Canfora and Luigi Cerulo. 2005. Impact analysis by mining software and change request repositories. In METRICS '05. IEEE, 9Ð-pp.
[16]
Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In NeurIPS ' 18. 2547ś2557.
[17]
Jong-Deok Choi, Ron Cytron, and Jeanne Ferrante. 1991. Automatic construction of sparse data flow evaluation graphs. In POPL '91. 55ś66.
[18]
Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, and Mark Streich. 1996. Efective representation of aliases and indirect memory operations in SSA form. In CC '96. Springer, 253ś267.
[19]
Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. A survey on network embedding. TKDE 31, 5 ( 2018 ), 833ś852.
[20]
Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. 2000. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications 21, 4 ( 2000 ), 1253ś1278.
[21]
Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM TOPLAS 9, 3 ( 1987 ), 319ś349.
[22]
Georgia Frantzeskou, Stephen MacDonell, Efstathios Stamatatos, and Stefanos Gritzalis. 2008. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81, 3 ( 2008 ), 447ś460.
[23]
Keith Brian Gallagher and James R Lyle. 1991. Using program slicing in software maintenance. IEEE TSE 17, 8 ( 1991 ), 751ś761.
[24]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD '16. ACM, 855ś864.
[25]
Ben Hardekopf and Calvin Lin. 2007. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In PLDI '07. ACM, 290ś299.
[26]
Ben Hardekopf and Calvin Lin. 2011. Flow-sensitive pointer analysis for millions of lines of code. In CGO '11. 289ś298.
[27]
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In ICSE '12. IEEE, 837ś847.
[28]
M E Hochstenbach. 2009. A JacobiśDavidson type method for the generalized singular value problem. Linear Algebra Appl. 431, 3-4 ( 2009 ), 471ś487.
[29]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In ICPC '18. 200ś210.
[30]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In ACL '16. Berlin, Germany, 2073ś2083. https://www.aclweb.org/anthology/P16-1195 Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE TSE 28, 7 ( 2002 ), 654ś670.
[31]
Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 ( 1953 ), 39ś43.
[32]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR '15, Yoshua Bengio and Yann LeCun (Eds.).
[33]
John Kodumal and Alex Aiken. 2004. The set constraint/CFL reachability connection in practice. PLDI '04 6 ( 2004 ), 207ś218.
[34]
Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, and Alberto Bacchelli. 2019. PathMiner: a library for mining of path-based representations of code. In MSR '19. 13ś17.
[35]
David J Kuck, Robert H Kuhn, David A Padua, Bruce Leasure, and Michael Wolfe. 1981. Dependence graphs and compiler optimizations. In POPL '81. ACM, 207ś218.
[36]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO '04. IEEE, 75ś86.
[37]
Yuxiang Lei and Yulei Sui. 2019. Fast and precise handling of positive weight cycles for field-sensitive pointer analysis. In SAS '19. Springer, 27ś47.
[38]
Ondrej Lhoták and Kwok-Chiang Andrew Chung. 2011. Points-To Analysis with Eficient Strong Updates. In POPL ' 11. 3ś16.
[39]
L Li, C Cifuentes, and N Keynes. 2011. Boosting the performance of flow-sensitive points-to analysis using value flow. In FSE '11. 343ś353.
[40]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A deep learning-based system for vulnerability detection. NDSS '18 ( 2018 ).
[41]
Defu Lian, Kai Zheng, Vincent W Zheng, Yong Ge, Longbing Cao, Ivor W Tsang, and Xing Xie. 2018. High-order proximity preserving information network hashing. In KDD '18. ACM, 1744ś1753.
[42]
V Benjamin Livshits and Monica S Lam. 2003. Tracking pointers with path and context sensitivity for bug detection in C programs. FSE '03 28, 5 ( 2003 ), 317ś326.
[43]
M T Luong, H Pham, and C D Manning. 2015. Efective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 ( 2015 ).
[44]
Chris Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In ICML '14. 649ś657.
[45]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jef Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS '13. 3111ś3119.
[46]
Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In KDD '16. ACM, 1105ś1114.
[47]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD '14. ACM, 701ś710.
[48]
Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages 2, OOPSLA ( 2018 ), 1ś25.
[49]
Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In PLDI '14. ACM, 419ś428.
[50]
Thomas Reps. 1998. Program analysis via graph reachability. IST 40, 11-12 ( 1998 ), 701ś726.
[51]
Juergen Rilling and Tuomas Klemola. 2003. Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics. In IEEE International Workshop on Program Comprehension. IEEE, 115ś124.
[52]
Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In ICSE '16. IEEE, 1157ś1168.
[53]
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29, 3 ( 2008 ), 93.
[54]
Qingkai Shi, Xiao Xiao, Rongxin Wu, Jinguo Zhou, Gang Fan, and Charles Zhang. 2018. Pinpoint: Fast and precise sparse value flow analysis for million lines of code. In PLDI '18. ACM, 693ś706.
[55]
Yao Shi, Soyeon Park, Zuoning Yin, Shan Lu, Yuanyuan Zhou, Wenguang Chen, and Weimin Zheng. 2010. Do I use the wrong definition?: DeFuse: definition-use invariants for detecting concurrency and sequential bugs. OOPSLA '10 45, 10 ( 2010 ), 160ś174.
[56]
Han Hee Song, Tae Won Cho, Vacha Dave, Yin Zhang, and Lili Qiu. 2009. Scalable proximity estimation and link prediction in online social networks. In ACM SIGCOMM. ACM, 322ś335.
[57]
Manu Sridharan and Rastislav Bodík. 2006. Refinement-based context-sensitive points-to analysis for Java. PLDI 41, 6 ( 2006 ), 387ś400.
[58]
Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In CC '16. ACM, 265ś266.
[59]
Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P Sadayappan, and Srinivasan Parthasarathy. 2019. Atp: Directed graph embedding with asymmetric transitivity preservation. In AAAI '19, Vol. 33. 265ś272.
[60]
Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In IJCNLP. Beijing, China, 1556ś1566.
[61]
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW '15. 1067ś1077.
[62]
Secil Ugurel, Robert Krovetz, and C Lee Giles. 2002. What's the code?: automatic classification of source code archives. In KDD '02. ACM, 632ś638.
[63]
Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. 2017. Community preserving network embedding. In AAAI '17.
[64]
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In AAAI '14.
[65]
Mark Weiser. 1981. Program slicing. In ICSE '81. IEEE Press, 439ś449.
[66]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML '15. 2048ś2057.
[67]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In ICSE '19. 783ś794.
[68]
Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018. Arbitrary-order proximity preserved network embedding. In KDD '18. ACM, 2778ś2786.
[69]
Gang Zhao and Jef Huang. 2018. Deepsim: deep learning code functional similarity. In FSE '18. ACM, 141ś151.
[70]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Efective vulnerability identification by learning comprehensive program semantics via graph neural networks. In NeurIPS '19. 10197ś10207.

Cited By

View all
  • (2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
  • (2024)Dynamic Transitive Closure-based Static Analysis through the Lens of Quantum SearchACM Transactions on Software Engineering and Methodology10.1145/364438933:5(1-29)Online publication date: 4-Jun-2024
  • (2024)Fast Graph Simplification for Path-Sensitive Typestate Analysis through Tempo-Spatial Multi-Point SlicingProceedings of the ACM on Software Engineering10.1145/36437491:FSE(494-516)Online publication date: 12-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
November 2020
3108 pages
EISSN:2475-1421
DOI:10.1145/3436718
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2020
Published in PACMPL Volume 4, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

  • Distinguished Paper

Author Tags

  1. Flow2Vec
  2. asymmetric transitivity
  3. code embedding
  4. value-flows

Qualifiers

  • Research-article

Funding Sources

  • Australian Research Council

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)830
  • Downloads (Last 6 weeks)70
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)AI-Assisted Programming Tasks Using Code Embeddings and TransformersElectronics10.3390/electronics1304076713:4(767)Online publication date: 15-Feb-2024
  • (2024)Dynamic Transitive Closure-based Static Analysis through the Lens of Quantum SearchACM Transactions on Software Engineering and Methodology10.1145/364438933:5(1-29)Online publication date: 4-Jun-2024
  • (2024)Fast Graph Simplification for Path-Sensitive Typestate Analysis through Tempo-Spatial Multi-Point SlicingProceedings of the ACM on Software Engineering10.1145/36437491:FSE(494-516)Online publication date: 12-Jul-2024
  • (2024)Precise Sparse Abstract Execution via Cross-Domain InteractionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639220(1-12)Online publication date: 20-May-2024
  • (2024)Esale: Enhancing Code-Summary Alignment Learning for Source Code SummarizationIEEE Transactions on Software Engineering10.1109/TSE.2024.342227450:8(2077-2095)Online publication date: Aug-2024
  • (2024)Stealthy Backdoor Attack for Code ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.336166150:4(721-741)Online publication date: Apr-2024
  • (2024)INSPECT: Intrinsic and Systematic Probing Evaluation for Code TransformersIEEE Transactions on Software Engineering10.1109/TSE.2023.334162450:2(220-238)Online publication date: Feb-2024
  • (2024)Ponzi Scheme Detection in Smart Contract via Transaction Semantic Representation LearningIEEE Transactions on Reliability10.1109/TR.2023.331931873:2(1117-1131)Online publication date: Jun-2024
  • (2024)How About Bug-Triggering Paths? - Understanding and Characterizing Learning-Based Vulnerability DetectorsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.319241921:2(542-558)Online publication date: Mar-2024
  • (2024)GraphBinMatch: Graph-Based Similarity Learning for Cross-Language Binary and Source Code Matching2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00103(506-515)Online publication date: 27-May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media