Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3270101.3270110acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article
Public Access

Integration of Static and Dynamic Code Stylometry Analysis for Programmer De-anonymization

Published: 15 January 2018 Publication History

Abstract

De-anonymizing the authors of anonymous code (i.e., code stylometry) entails significant privacy and security implications. Most existing code stylometry methods solely rely on static (e.g., lexical, layout, and syntactic) features extracted from source code, while neglecting its key difference from regular text -- it is executable! In this paper, we present Sundae, a novel code de-anonymization framework that integrates both static and dynamic stylometry analysis. Compared with the existing solutions, Sundae departs in significant ways: (i) it requires much less number of static, hand-crafted features; (ii) it requires much less labeled data for training; and (iii) it can be readily extended to new programmers once their stylometry information becomes available Through extensive evaluation on benchmark datasets, we demonstrate that Sundae delivers strong empirical performance. For example, under the setting of 229 programmers and 9 problems, it outperforms the state-of-art method by a margin of 45.65% on Python code de-anonymization. The empirical results highlight the integration of static and dynamic analysis as a promising direction for code stylometry research.

References

[1]
2017. Google Code Jam. https://code.google.com/codejam/ Google Code Jam link.
[2]
2017. Google Code Jam contest 3264486. https://code.google.com/codejam/ contest/3264486/dashboard Google Code Jam link.
[3]
2018. Abstract Syntax Trees. https://docs.python.org/2/library/ast.html Python AST module.
[4]
2018. Anaconda. https://www.anaconda.com/ Anaconda.
[5]
2018. Disassembler for Python bytecode. https://docs.python.org/3/library/dis. html Python dis module.
[6]
2018. memory profile. https://pypi.org/project/memory_profiler/ Python memory profile.
[7]
2018. Python cProfile. https://docs.python.org/2/library/profile.html Python cProfile.
[8]
Sadia Afroz, Michael Brennan, and Rachel Greenstadt. 2012. Detecting hoaxes, frauds, and deception in writing style online. In Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 461--475.
[9]
Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt. 2017. Source Code Authorship Attribution Using Long Short-Term Memory Based Networks. In European Symposium on Research in Computer Security. Springer, 65--82.
[10]
Alessandro Bacci, Alberto Bartoli, Fabio Martinelli, Eric Medvet, Francesco Mercaldo, and Corrado Aaron Visaggio. 2018. Impact of code obfuscation on android malware detection based on static and dynamic analysis. In 4th International Conference on Information Systems Security and Privacy. Scitepress, 379--385.
[11]
Thomas Ball. 1999. The concept of dynamic analysis. In Software Engineering? ESEC/FSE?99. Springer, 216--234.
[12]
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A deep siamese network for scene detection in broadcast videos. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1199--1202.
[13]
Shane Bergsma, Matt Post, and David Yarowsky. 2012. Stylometric analysis of scientific articles. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 327--337.
[14]
Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC) 15, 3 (2012), 12.
[15]
Marcelo Luiz Brocardo, Issa Traore, Sherif Saad, and Isaac Woungang. 2013. Authorship verification for short messages using stylometry. In Computer, Information and Telecommunication Systems (CITS), 2013 International Conference on. IEEE, 1--6.
[16]
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a" siamese" time delay neural network. In Advances in Neural Information Processing Systems. 737--744.
[17]
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security), Washington, DC.
[18]
Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2015. When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015).
[19]
Qichang Chen, Liqiang Wang, Zijiang Yang, and Scott D Stoller. 2009. HAVE: detecting atomicity violations via integrated dynamic and static analysis. In International Conference on Fundamental Approaches to Software Engineering. Springer, 425--439.
[20]
Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49--57.
[21]
Bruce S Elenbogen and Naeem Seliya. 2008. Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges 23, 3 (2008), 50--57.
[22]
Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Syntactic stylometry for deception detection. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 171--175.
[23]
Christian Ferdinand and Reinhold Heckmann. 2004. ait: Worst-case execution time prediction by static program analysis. In Building the Information Society. Springer, 377--383.
[24]
Georgia Frantzeskou, Stephen MacDonell, Efstathios Stamatatos, and Stefanos Gritzalis. 2008. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81, 3 (2008), 447--460.
[25]
Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Source code author identification based on n-gram author profiles. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 508--515.
[26]
Andrew Gray, Stephen MacDonell, and Philip Sallis. 1997. Software forensics: Extending authorship analysis techniques to computer programs. (1997).
[27]
Xin Hu and Kang G Shin. 2013. DUET: integration of dynamic and static analyses for malware clustering with cluster ensembles. In Proceedings of the 29th annual computer security applications conference. ACM, 79--88.
[28]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Vol. 2.
[29]
Jay Kothari, Maxim Shevertalov, Edward Stehle, and Spiros Mancoridis. 2007. A probabilistic approach to source code authorship identification. In Information Technology, 2007. ITNG'07. Fourth International Conference on. IEEE, 243--248.
[30]
Flavius-Mihai Lazar and Ovidiu Banias. 2014. Clone detection algorithm based on the Abstract Syntax Tree approach. In Applied Computational Intelligence and Informatics (SACI), 2014 IEEE 9th International Symposium on. IEEE, 73--78.
[31]
Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18--22.
[32]
Andrew WE McDonald, Sadia Afroz, Aylin Caliskan, Ariel Stolerman, and Rachel Greenstadt. 2012. Use fewer instances of the letter : Toward writing style anonymization. In International Symposium on Privacy Enhancing Technologies Symposium. Springer, 299--318.
[33]
Anders Moller and Michael I Schwartzbach. 2012. Static program analysis.
[34]
Iulian Neamtiu, Jeffrey S Foster, and Michael Hicks. 2005. Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Software Engineering Notes 30, 4 (2005), 1--5.
[35]
Brian N Pellin. 2000. Using classification techniques to determine source code authorship. White Paper: Department of Computer Science, University of Wisconsin (2000).
[36]
Nathan Rosenblum, Xiaojin Zhu, and Barton P Miller. 2011. Who wrote this code? identifying the authors of program binaries. In European Symposium on Research in Computer Security. Springer, 172--189.
[37]
Eugene H Spafford and Stephen A Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security 12, 6 (1993), 585--595.
[38]
Andre Van Hoorn, Jan Waller, and Wilhelm Hasselbring. 2012. Kieker: A framework for application performance monitoring and dynamic software analysis. In Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. ACM, 247--248.
[39]
Kostyantyn Vorobyov and Padmanabhan Krishnan. 2010. Comparing model checking and static program analysis: A case study in error detection approaches. Proceedings of SSV (2010).
[40]
Ke Wang, Rishabh Singh, and Zhendong Su. 2017. Dynamic Neural Program Embedding for Program Repair. CoRR abs/1711.07163 (2017). arXiv:1711.07163 http://arxiv.org/abs/1711.07163
[41]
Wilco Wisse and Cor Veenman. 2015. Scripting dna: Identifying the javascript programmer. Digital Investigation 15 (2015), 61--71.

Cited By

View all
  • (2024)Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive SurveyInformation10.3390/info1503013115:3(131)Online publication date: 28-Feb-2024
  • (2021)Towards Improving Code Stylometry Analysis in Underground ForumsProceedings on Privacy Enhancing Technologies10.2478/popets-2022-00072022:1(126-147)Online publication date: 20-Nov-2021
  • (2021)A Method of Source Code Authorship Attribution Based on Graph Neural NetworkProceedings of 2021 Chinese Intelligent Automation Conference10.1007/978-981-16-6372-7_70(645-657)Online publication date: 8-Oct-2021
  • Show More Cited By

Index Terms

  1. Integration of Static and Dynamic Code Stylometry Analysis for Programmer De-anonymization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AISec '18: Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security
    October 2018
    103 pages
    ISBN:9781450360043
    DOI:10.1145/3270101
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 January 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. code stylometry
    2. de-anonymization
    3. dynamic analysis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CCS '18
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 22 of 53 submissions, 42%

    Upcoming Conference

    CCS '24
    ACM SIGSAC Conference on Computer and Communications Security
    October 14 - 18, 2024
    Salt Lake City , UT , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)105
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive SurveyInformation10.3390/info1503013115:3(131)Online publication date: 28-Feb-2024
    • (2021)Towards Improving Code Stylometry Analysis in Underground ForumsProceedings on Privacy Enhancing Technologies10.2478/popets-2022-00072022:1(126-147)Online publication date: 20-Nov-2021
    • (2021)A Method of Source Code Authorship Attribution Based on Graph Neural NetworkProceedings of 2021 Chinese Intelligent Automation Conference10.1007/978-981-16-6372-7_70(645-657)Online publication date: 8-Oct-2021
    • (2020)Source Code Authorship Identification Using Deep Neural NetworksSymmetry10.3390/sym1212204412:12(2044)Online publication date: 10-Dec-2020
    • (2020)Zero-Shot Source Code Author Identification: A Lexicon and Layout Independent Approach2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9207647(1-8)Online publication date: Jul-2020
    • (2019)De-Anonymization of the Author of the Source Code Using Machine Learning Algorithms2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON)10.1109/SIBIRCON48586.2019.8958026(0612-0617)Online publication date: Oct-2019

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media