Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2831143.2831160guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

De-anonymizing programmers via code stylometry

Published: 12 August 2015 Publication History

Abstract

Source code authorship attribution is a significant privacy threat to anonymous code contributors. However, it may also enable attribution of successful attacks from code left behind on an infected system, or aid in resolving copyright, copyleft, and plagiarism issues in the programming fields. In this work, we investigate machine learning methods to de-anonymize source code authors of C/C++ using coding style. Our Code Stylometry Feature Set is a novel representation of coding style found in source code that reflects coding style from properties derived from abstract syntax trees.
Our random forest and abstract syntax tree-based approach attributes more authors (1,600 and 250) with significantly higher accuracy (94% and 98%) on a larger data set (Google Code Jam) than has been previously achieved. Furthermore, these novel features are robust, difficult to obfuscate, and can be used in other programming languages, such as Python. We also find that (i) the code resulting from difficult programming tasks is easier to attribute than easier tasks and (ii) skilled programmers (who can complete the more difficult tasks) are easier to attribute than less skilled programmers.

References

[1]
The tigress diversifying c virtualizer, http://tigress.cs.arizona.edu.
[2]
Google code jam, https://code.google.com/codejam, 2014.
[3]
Stunnix, http://www.stunnix.com/prod/cxxo/, November 2014.
[4]
ABBASI, A., AND CHEN, H. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2 (2008), 1-29.
[5]
AFROZ, S., BRENNAN, M., AND GREENSTADT, R. Detecting hoaxes, frauds, and deception in writing style online. In Security and Privacy (SP), 2012 IEEE Symposium on (2012), IEEE, pp. 461-475.
[6]
AIKEN, A., ET AL. Moss: A system for detecting software plagiarism. University of California-Berkeley. See www.cs.berkeley. edu/aiken/moss.html9 (2005).
[7]
BREIMAN, L. Random forests. Machine Learning 45, 1 (2001), 5-32.
[8]
BURROWS, S., AND TAHAGHOGHI, S. M. Source code authorship attribution using n-grams. In Proc. of the Australasian Document Computing Symposium (2007).
[9]
BURROWS, S., UITDENBOGERD, A. L., AND TURPIN, A. Application of information retrieval techniques for source code authorship attribution. In Database Systems for Advanced Applications (2009), Springer, pp. 699-713.
[10]
DING, H., AND SAMADZADEH, M. H. Extraction of java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49-57.
[11]
ELENBOGEN, B. S., AND SELIYA, N. Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges 23, 3 (2008), 50-57.
[12]
FRANTZESKOU, G., MACDONELL, S., STAMATATOS, E., AND GRITZALIS, S. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81, 3 (2008), 447-460.
[13]
FRANTZESKOU, G., STAMATATOS, E., GRITZALIS, S., CHASKI, C. E., AND HOWALD, B. S. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence 6, 1 (2007), 1-18.
[14]
FRANTZESKOU, G., STAMATATOS, E., GRITZALIS, S., AND KATSIKAS, S. Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering (2006), ACM, pp. 893-896.
[15]
GRAY, A., SALLIS, P., AND MACDONELL, S. Software forensics: Extending authorship analysis techniques to computer programs.
[16]
HAYES, J. H., AND OFFUTT, J. Recognizing authors: an examination of the consistent programmer hypothesis. Software Testing, Verification and Reliability 20, 4 (2010), 329-356.
[17]
INOCENCIO, R. U.s. programmer outsources own job to china, surfs cat videos, January 2013.
[18]
KOTHARI, J., SHEVERTALOV, M., STEHLE, E., AND MANCORIDIS, S. A probabilistic approach to source code authorship identification. In Information Technology, 2007. ITNG'07. Fourth International Conference on (2007), IEEE, pp. 243-248.
[19]
KRSUL, I., AND SPAFFORD, E. H. Authorship analysis: Identifying the author of a program. Computers & Security 16, 3 (1997), 233-257.
[20]
LANGE, R. C., AND MANCORIDIS, S. Using code metric histograms and genetic algorithms to perform author identification for software forensics. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (2007), ACM, pp. 2082-2089.
[21]
MACDONELL, S. G., GRAY, A. R., MACLENNAN, G., AND SALLIS, P. J. Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis. In Neural Information Processing, 1999. Proceedings. ICONIP'99. 6th International Conference on (1999), vol. 1, IEEE, pp. 66-71.
[22]
NARAYANAN, A., PASKOV, H., GONG, N. Z., BETHENCOURT, J., STEFANOV, E., SHIN, E. C. R., AND SONG, D. On the feasibility of internet-scale author identification. In Security and Privacy (SP), 2012 IEEE Symposium on (2012), IEEE, pp. 300-314.
[23]
PELLIN, B. N. Using classification techniques to determine source code authorship. White Paper: Department of Computer Science, University of Wisconsin (2000).
[24]
PIKE, R. The sherlock plagiarism detector, 2011.
[25]
PRECHELT, L., MALPOHL, G., AND PHILIPPSEN, M. Finding plagiarisms among a set of programs with jplag. J. UCS 8, 11 (2002), 1016.
[26]
QUINLAN, J. Induction of decision trees. Machine learning 1, 1 (1986), 81-106.
[27]
ROSENBLUM, N., ZHU, X., AND MILLER, B. Who wrote this code? identifying the authors of program binaries. Computer Security-ESORICS 2011 (2011), 172-189.
[28]
SHEVERTALOV, M., KOTHARI, J., STEHLE, E., AND MANCORIDIS, S. On the use of discretized source code metrics for author identification. In Search Based Software Engineering, 2009 1st International Symposium on (2009), IEEE, pp. 69-78.
[29]
SPAFFORD, E. H., AND WEEBER, S. A. Software forensics: Can we track code to its authors? Computers & Security 12, 6 (1993), 585-595.
[30]
STOLERMAN, A., OVERDORF, R., AFROZ, S., AND GREENSTADT, R. Classify, but verify: Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Working Group 11.9 on Digital Forensics (2014), IFIP.
[31]
WIKIPEDIA. Saeed Malekpour, 2014. [Online; accessed 04- November-2014].
[32]
YAMAGUCHI, F., GOLDE, N., ARP, D., AND RIECK, K. Modeling and discovering vulnerabilities with code property graphs. In Proc of IEEE Symposium on Security and Privacy (S&P) (2014).
[33]
YAMAGUCHI, F., WRESSNEGGER, C., GASCON, H., AND RIECK, K. Chucky: Exposing missing checks in source code for vulnerability discovery. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security (2013), ACM, pp. 499-510.

Cited By

View all
  • (2024)Identifying Authorship in Malicious Binaries: Features, Challenges & DatasetsACM Computing Surveys10.1145/365397356:8(1-36)Online publication date: 26-Mar-2024
  • (2024)Reducing the Impact of Time Evolution on Source Code Authorship Attribution via Domain AdaptationACM Transactions on Software Engineering and Methodology10.1145/365215133:6(1-27)Online publication date: 27-Jun-2024
  • (2024)Enhancing Robustness of Code Authorship Attribution through Expert Feature KnowledgeProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652121(199-209)Online publication date: 11-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
SEC'15: Proceedings of the 24th USENIX Conference on Security Symposium
August 2015
1072 pages
ISBN:9781931971232

Sponsors

  • USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 12 August 2015

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Identifying Authorship in Malicious Binaries: Features, Challenges & DatasetsACM Computing Surveys10.1145/365397356:8(1-36)Online publication date: 26-Mar-2024
  • (2024)Reducing the Impact of Time Evolution on Source Code Authorship Attribution via Domain AdaptationACM Transactions on Software Engineering and Methodology10.1145/365215133:6(1-27)Online publication date: 27-Jun-2024
  • (2024)Enhancing Robustness of Code Authorship Attribution through Expert Feature KnowledgeProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652121(199-209)Online publication date: 11-Sep-2024
  • (2024)How Important Are Good Method Names in Neural Code Generation? A Model Robustness PerspectiveACM Transactions on Software Engineering and Methodology10.1145/363001033:3(1-35)Online publication date: 14-Mar-2024
  • (2023)PACE: A Program Analysis Framework for Continuous Performance PredictionACM Transactions on Software Engineering and Methodology10.1145/363723033:4(1-23)Online publication date: 14-Dec-2023
  • (2022)A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and FeaturesACM Computing Surveys10.1145/348686055:1(1-41)Online publication date: 17-Jan-2022
  • (2021)Optimization to the RescueProceedings of the 2021 Research on offensive and defensive techniques in the Context of Man At The End (MATE) Attacks10.1145/3465413.3488574(1-10)Online publication date: 19-Nov-2021
  • (2021)Large-scale and Robust Code Authorship Identification with Deep Feature LearningACM Transactions on Privacy and Security10.1145/346166624:4(1-35)Online publication date: 19-Jul-2021
  • (2020)The Limitations of Stylometry for Detecting Machine-Generated Fake NewsComputational Linguistics10.1162/coli_a_0038046:2(499-510)Online publication date: 1-Jun-2020
  • (2020)Towards In-Band Non-Cryptographic AuthenticationProceedings of the New Security Paradigms Workshop 202010.1145/3442167.3442180(20-33)Online publication date: 26-Oct-2020
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media