Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

DéjàVu: a map of code duplicates on GitHub

Published: 12 October 2017 Publication History

Abstract

Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 6% of the files are distinct. Java, on the other hand, has the least duplication, 60% of files are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created DéjàVu, a publicly available map of code duplicates in GitHub repositories.

Supplementary Material

Auxiliary Archive (oopsla17-oopsla176-aux.zip)

References

[1]
T. F. Bissyande, F. Thung, D. Lo, L. Jiang, and L. Reveillere. 2013. Orion: A Sotware Project Search Engine with Integrated Diverse Sotware Artifacts. In International Conference on Engineering of Complex Computer Systems.
[2]
Stephen M. Blackburn, Robin Garner, Chris Hofmann, Asjad M. Khan, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony L. Hosking, Maria Jump, Han Bok Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA).
[3]
Hudson Borges, André C. Hora, and Marco Tulio Valente. 2016. Understanding the Factors that Impact the Popularity of GitHub Repositories. (2016). http://arxiv.org/abs/1606.04984
[4]
Casey Casalnuovo, Prem Devanbu, Abilio Oliveira, Vladimir Filkov, and Baishakhi Ray. 2015. Assert Use in GitHub Projects. In International Conference on Sotware Engineering (ICSE). http://dl.acm.org/citation.cfm?id=2818754.2818846
[5]
James R. Cordy, Thomas R. Dean, and Nikita Synytskyy. 2004. Practical Language-independent Detection of Near-miss Clones. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). http://dl.acm.org/citation. cfm?id=1034914.1034915
[6]
V. Cosentino, J. L. C. Izquierdo, and J. Cabot. 2016. Findings from GitHub: Methods, Datasets and Limitations. In Working Conference on Mining Sotware Repositories (MSR).
[7]
John W. Creswell. 2014. Research Design: ualitative, uantitative, and Mixed Methods Approaches. SAGE.
[8]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Sotware Repositories. In International Conference on Sotware Engineering (ICSE). http: //dl.acm.org/citation.cfm?id=2486788.2486844
[9]
Jesus M. Gonzalez-Barahona, Gregorio Robles, and Santiago Dueñas. 2010. Collecting Data About FLOSS Development: The FLOSSMetrics Experience. In International Workshop on Emerging Trends in Free/Libre/Open Source Sotware Research and Development (FLOSS).
[10]
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Working Conference on Mining Sotware Repositories (MSR).
[11]
Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, and Maximilian Irlbeck. 2011. On the Extent and Nature of Sotware Reuse in Open Source Java Projects. Berlin, Heidelberg.
[12]
Felipe Hofa. 2016. 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs? (2016). https: //medium.com/@hofa/400-000-github-repositories-1-billion-iles-14-terabytes-of-code-spaces-or-tabs-7cfe0b5dd7fd
[13]
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Working Conference on Mining Sotware Repositories (MSR).
[14]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A Multilinguistic Token-based Code Clone Detection System for Large Scale Source Code. IEEE Trans. Sotw. Eng. 28, 7 (2002).
[15]
P. S. Kochhar, T. F. BissyandÃľ, D. Lo, and L. Jiang. 2013. Adoption of Sotware Testing in Open Source ProjectsśA Preliminary Study on 50,000 Projects. In European Conference on Sotware Maintenance and Reengineering.
[16]
R. Koschke. 2007. Survey of research on sotware clones. In Duplication, Redundancy, and Similarity in Sotware (Dagstuhl Seminar Proceedings 06301).
[17]
A. Mockus. 2007. Large-Scale Code Reuse in Open Source Sotware. In First International Workshop on Emerging Trends in FLOSS Research and Development.
[18]
A. Mockus. 2009. Amassing and Indexing a Large Sample of Version Control Systems: Towards the Census of Public Source Code History. In Working Conference on Mining Sotware Repositories (MSR).
[19]
Meiyappan Nagappan, Thomas Zimmermann, and Christian Bird. 2013. Diversity in Sotware Engineering Research. In Foundations of Sotware Engineering (FSE).
[20]
J. Ossher, Sushil Bajracharya, E. Linstead, P. Baldi, and Crista Lopes. 2009. SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects. In Working Conference on Mining Sotware Repositories (MSR).
[21]
Joel Ossher, Hitesh Sajnani, and Cristina Lopes. 2011. File Cloning in Open Source Java Projects: The Good, the Bad, and the Ugly. In International Conference on Sotware Maintenance (ICSM).
[22]
Baishakhi Ray, Daryl Posnet, Vladimir Filkov, and Premkumar Devanbu. 2014. A Large Scale Study of Programming Languages and Code uality in Github. In International Symposium on Foundations of Sotware Engineering (FSE).
[23]
Gregor Richards, Andreas Gal, Brendan Eich, and Jan Vitek. 2011. Automated Construction of JavaScript Benchmarks. In Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA).
[24]
C. K. Roy and J. R. Cordy. 2007. A survey on sotware clone detection research. Technical Report 541. ueens University.
[25]
Chanchal K. Roy and James R. Cordy. 2009. A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools. In International Conference on Sotware Testing, Verification, and Validation.
[26]
C. K. Roy and J. R. Cordy. 2010. Near-miss Function Clones in Open Source Sotware: An Empirical Study. J. Sotw. Maint. Evol. 22, 3 (2010).
[27]
Hitesh Sajnani. 2016. Large-Scale Code Clone Detection. Ph.D. Dissertation. University of California, Irvine.
[28]
Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling Code Clone Detection to Big-code. In International Conference on Sotware Engineering (ICSE).
[29]
Johnny Saldaña. 2009. The Coding Manual for ualitative Researchers. SAGE.
[30]
SPEC. 1998. SPECjvm98 benchmarks. (1998).
[31]
J. Svajlenko and C. K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In International Conference on Sotware Maintenance and Evolution (ICSME).
[32]
Christopher Vendome, Gabriele Bavota, Massimiliano Di Penta, Mario Linares-Vásquez, Daniel German, and Denys Poshyvanyk. 2016. License usage and changes: a large-scale study on GitHub. Empirical Sotware Engineering (2016).

Cited By

View all
  • (2024)Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-task LearningProceedings of the ACM on Software Engineering10.1145/36607821:FSE(1679-1702)Online publication date: 12-Jul-2024
  • (2024)Dataset: Copy-based Reuse in Open Source SoftwareProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644868(42-47)Online publication date: 15-Apr-2024
  • (2024)AntiCopyPaster 2.0: Whitebox just-in-time code duplicates extractionProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640035(84-88)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 1, Issue OOPSLA
October 2017
1786 pages
EISSN:2475-1421
DOI:10.1145/3152284
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2017
Published in PACMPL Volume 1, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Clone Detection
  2. Source Code Analysis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)611
  • Downloads (Last 6 weeks)65
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-task LearningProceedings of the ACM on Software Engineering10.1145/36607821:FSE(1679-1702)Online publication date: 12-Jul-2024
  • (2024)Dataset: Copy-based Reuse in Open Source SoftwareProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644868(42-47)Online publication date: 15-Apr-2024
  • (2024)AntiCopyPaster 2.0: Whitebox just-in-time code duplicates extractionProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640035(84-88)Online publication date: 14-Apr-2024
  • (2024)SourcererJBF: A Java Build Framework For Large-Scale CompilationACM Transactions on Software Engineering and Methodology10.1145/363571033:3(1-35)Online publication date: 15-Mar-2024
  • (2024)Data-Driven Evidence-Based Syntactic Sugar DesignProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639580(1-12)Online publication date: 20-May-2024
  • (2024)CNEPS: A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source ComponentsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639209(1-12)Online publication date: 20-May-2024
  • (2024)Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639183(1-13)Online publication date: 20-May-2024
  • (2024)BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code MatchingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639100(1-13)Online publication date: 20-May-2024
  • (2024)Automatically Identifying CVE Affected Versions With Patches and Developer LogsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.326456721:2(905-919)Online publication date: Mar-2024
  • (2024)Code search engines for the next generationJournal of Systems and Software10.1016/j.jss.2024.112065215(112065)Online publication date: Sep-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media