research-article

Open access

On the Impact of Programming Languages on Code Quality: A Reproduction Study

Authors:

Emery D. Berger,

Celeste Hollenbeck,

Jan VitekAuthors Info & Claims

ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 41, Issue 4

Article No.: 21, Pages 1 - 24

https://doi.org/10.1145/3340571

Published: 12 October 2019 Publication History

All formats PDF

Abstract

In a 2014 article, Ray, Posnett, Devanbu, and Filkov claimed to have uncovered a statistically significant association between 11 programming languages and software defects in 729 projects hosted on GitHub. Specifically, their work answered four research questions relating to software defects and programming languages. With data and code provided by the authors, the present article first attempts to conduct an experimental repetition of the original study. The repetition is only partially successful, due to missing code and issues with the classification of languages. The second part of this work focuses on their main claim, the association between bugs and languages, and performs a complete, independent reanalysis of the data and of the statistical modeling steps undertaken by Ray et al. in 2014. This reanalysis uncovers a number of serious flaws that reduce the number of languages with an association with defects down from 11 to only 4. Moreover, the practical effect size is exceedingly small. These results thus undermine the conclusions of the original study. Correcting the record is important, as many subsequent works have cited the 2014 article and have asserted, without evidence, a causal link between the choice of programming language for a given task and the number of software defects. Causation is not supported by the data at hand; and, in our opinion, even after fixing the methodological flaws we uncovered, too many unaccounted sources of bias remain to hope for a meaningful comparison of bug rates across languages.

Supplementary Material

a21-vitek (a21-vitek.webm)

Presentation at SIGPLAN SPLASH '19

Download
118.75 MB

References

[1]

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 1 (1995).

[2]

Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and balanced?: Bias in bug-fix datasets. In Proceedings of the Symposium on the Foundations of Software Engineering (ESEC/FSE’09).

Digital Library

[3]

Casey Casalnuovo, Yagnik Suchak, Baishakhi Ray, and Cindy Rubio-González. 2017. GitcProc: A tool for processing and classifying github commits. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’17).

Digital Library

[4]

David Colquhoun. 2017. The reproducibility of research and the misinterpretation of p-values. R. Soc. Open Sci. 4, 171085 (2017).

[5]

Premkumar T. Devanbu. 2018. Research Statement. Retrieved from www.cs.ucdavis.edu/&sim;devanbu/research.pdf.

[6]

Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien Nguyen. 2013. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the International Conference on Software Engineering (ICSE’13).

[7]

J. J. Faraway. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.

[8]

Dror G. Feitelson. 2015. From repeatability to reproducibility and corroboration. SIGOPS Oper. Syst. Rev. 49, 1 (Jan. 2015).

Digital Library

[9]

Omar S. Gómez, Natalia Juristo Juzgado, and Sira Vegas. 2010. Replications types in experimental disciplines. In Proceedings of the Symposium on Empirical Software Engineering and Measurement (ESEM’10).

Digital Library

[10]

Garrett Grolemund and Hadley Wickham. 2017. R for Data Science. O’Reilly.

[11]

Lewis G. Halsey, Douglas Curran-Everett, Sarah L. Vowler, and Gordon B. Drummond. 2015. The fickle p-value generates irreproducible results. Nat. Methods 12 (2015).

[12]

Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In Proceedings of the International Conference on Software Engineering (ICSE’13).

[13]

John Ioannidis. 2005. Why most published research findings are false. PLoS Med 2, 8 (2005).

[14]

George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. In Proceedings of the Conference on Computer and Communications Security (CCS’18).

Digital Library

[15]

Paul Krill. 2014. Functional languages rack up best scores for software quality. InfoWorld (Nov. 2014). https://www.infoworld.com/article/2844268/functional-languages-rack-up-best-scores-software-quality.html.

[16]

Shriram Krishnamurthi and Jan Vitek. 2015. The real software crisis: Repeatability as a core value. Commun. ACM 58, 3 (2015).

Digital Library

[17]

Michael H. Kutner, John Neter, Christopher J. Nachtsheim, and William Li. 2004. Applied Linear Statistical Models. McGraw–Hill Education, New York, NY. https://books.google.cz/books?id=XAzYCwAAQBAJ

[18]

Crista Lopes, Petr Maj, Pedro Martins, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. Déjà Vu: A map of code duplicates on GitHub. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’17).

Digital Library

[19]

Audris Mockus and Lawrence Votta. 2000. Identifying reasons for software changes using historic databases. In Proceedings of the International Conference on Software Maintenance (ICSM’00).

[20]

Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: Essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the International Conference on Software Engineering (ICSE’14).

Digital Library

[21]

Sebastian Nanz and Carlo A. Furia. 2015. A comparative study of programming languages in rosetta code. In Proceedings of the International Conference on Software Engineering (ICSE’15). http://dl.acm.org/citation.cfm?id=2818754.2818848.

[22]

Roger Peng. 2011. Reproducible research in computational science. Science 334, 1226 (2011).

[23]

Dong Qiu, Bixin Li, Earl T. Barr, and Zhendong Su. 2017. Understanding the syntactic rule usage in Java. J. Syst. Softw. 123 (Jan. 2017), 160--172.

[24]

B. Ray and D. Posnett. 2016. A large ecosystem study to understand the effect of programming languages on code quality. In Perspectives on Data Science for Software Engineering. Morgan Kaufmann.

[25]

Baishakhi Ray, Daryl Posnett, Premkumar T. Devanbu, and Vladimir Filkov. 2017. A large-scale study of programming languages and code quality in GitHub. Commun. ACM 60, 10 (2017).

Digital Library

[26]

Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar T. Devanbu. 2014. A large scale study of programming languages and code quality in GitHub. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE’14).

Digital Library

[27]

Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo. 2018. Statistical errors in software engineering experiments: A preliminary literature review. In Proceedings of the International Conference on Software Engineering (ICSE’18).

Digital Library

[28]

Yuan Tian, Julia Lawall, and David Lo. 2012. Identifying linux bug fixing patches. In Proceedings of the International Conference on Software Engineering (ICSE’12).

[29]

Jan Vitek and Tomas Kalibera. 2011. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the International Conference on Embedded Software (EMSOFT’11). 33--38.

Digital Library

[30]

Ronald L. Wasserstein and Nicole A. Lazar. 2016. The ASA’s statement on p-values: Context, process, and purpose. Am. Stat. 70, 2 (2016).

[31]

Jie Zhang, Feng Li, Dan Hao, Meng Wang, and Lu Zhang. 2018. How does bug-handling effort differ among different programming languages? CoRR abs/1801.01025 (2018). http://arxiv.org/abs/1801.01025.

Cited By

Wang HGao ZBi TGrundy JWang XWu MYang X(2024)What Makes a Good TODO Comment?ACM Transactions on Software Engineering and Methodology10.1145/366481133:6(1-30)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3664811
Yang HNong YZhang TLuo XCai H(2024)Learning to Detect and Localize Multilingual BugsProceedings of the ACM on Software Engineering10.1145/36608041:FSE(2190-2213)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660804
Chen ZChen LYang YFeng QLi XSong W(2024)Risky Dynamic Typing-related Practices in Python: An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/364959333:6(1-35)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3649593
Show More Cited By

Index Terms

On the Impact of Programming Languages on Code Quality: A Reproduction Study
1. General and reference
  1. Cross-computing tools and techniques
    1. Empirical studies
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A large scale study of programming languages and code quality in github
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (729 projects, 80 Million SLOC, 29,000 authors, 1.5 million ...
A large-scale study of programming languages and code quality in GitHub

What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (728 projects, 63 million SLOC, 29,000 authors, 1.5 million ...
Impact of refactoring on quality code evaluation
WRT '11: Proceedings of the 4th Workshop on Refactoring Tools

Code smells are characteristics of the software that may indicate a code or design problem that can make software hard to understand, to evolve and maintain. Detecting code smells in the code and consequently applying the right refactoring steps, when ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems

ACM Transactions on Programming Languages and Systems Volume 41, Issue 4

December 2019

186 pages

ISSN:0164-0925

EISSN:1558-4593

DOI:10.1145/3366632

Editor:
Andrew Myers
Cornell University, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Accepted: 01 June 2019

Revised: 01 May 2019

Received: 01 December 2018

Published in TOPLAS Volume 41, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

Programming Languages on Code Quality

Qualifiers

Research-article
Research
Refereed

Funding Sources

Czech Ministry of Education, Youth and Sports
NSF
European Research Council under the European Union's Horizon 2020 research and innovation programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
4,599
Total Downloads

Downloads (Last 12 months)1,560
Downloads (Last 6 weeks)84

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang HGao ZBi TGrundy JWang XWu MYang X(2024)What Makes a Good TODO Comment?ACM Transactions on Software Engineering and Methodology10.1145/366481133:6(1-30)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3664811
Yang HNong YZhang TLuo XCai H(2024)Learning to Detect and Localize Multilingual BugsProceedings of the ACM on Software Engineering10.1145/36608041:FSE(2190-2213)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660804
Chen ZChen LYang YFeng QLi XSong W(2024)Risky Dynamic Typing-related Practices in Python: An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/364959333:6(1-35)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3649593
Li WMarino AYang HMeng NLi LCai H(2024)How Are Multilingual Systems Constructed: Characterizing Language Use and Selection in Open-Source Multilingual SoftwareACM Transactions on Software Engineering and Methodology10.1145/363196733:3(1-46)Online publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1145/3631967
Alfadel MMcIntosh SRoychoudhury APaiva AAbreu RStorey M(2024)The Classics Never Go Out of Style: An Empirical Study of Downgrades from the Bazel Build TechnologyProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639169(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639169
Li ZJi JLiang PMo RLiu H(2024)An exploratory study on just-in-time multi-programming-language bug predictionInformation and Software Technology10.1016/j.infsof.2024.107524175(107524)Online publication date: Nov-2024
https://doi.org/10.1016/j.infsof.2024.107524
Li FLou YTan XChen ZDong JLi YWang XHao DZhang L(2024)What can we learn from quality assurance badges in open-source software?Science China Information Sciences10.1007/s11432-022-3611-367:4Online publication date: 26-Mar-2024
https://doi.org/10.1007/s11432-022-3611-3
Hanenberg SMorzeck JGruhn V(2024)Indentation and reading time: a randomized control trial on the differences between generated indented and non-indented if-statementsEmpirical Software Engineering10.1007/s10664-024-10531-y29:5Online publication date: 9-Aug-2024
https://doi.org/10.1007/s10664-024-10531-y
Neder FMiranda Filho RAzevedo JPessoa LDe Freitas RBarreto R(2023)Análise comparativa entre linguagens de programação em sistemas embarcados móveis AndroidProceedings of the XXVII Brazilian Symposium on Programming Languages10.1145/3624309.3624319(56-63)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1145/3624309.3624319
Furia CTorkar RFeldt R(2023)Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding CompetitionsACM Transactions on Software Engineering and Methodology10.1145/361166733:1(1-35)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.1145/3611667
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents