Abstract
Background
Research software plays an important role in solving real-life problems, empowering scientific innovations, and handling emergency situations. Therefore, the correctness and trustworthiness of research software are of absolute importance. Software testing is an important activity for identifying problematic code and helping to produce high-quality software. However, testing of research software is difficult due to the complexity of the underlying science, relatively unknown results from scientific algorithms, and the culture of the research software community.
Aims
The goal of this paper is to better understand current testing practices, identify challenges, and provide recommendations on how to improve the testing process for research software development.
Method
We surveyed members of the research software developer community to collect information regarding their knowledge about and use of software testing in their projects.
Results
We analysed 120 responses and identified that even though research software developers report they have an average level of knowledge about software testing, they still find it difficult due to the numerous challenges involved. However, there are a number of ways, such as proper training, that can improve the testing process for research software.
Conclusions
Testing can be challenging for any type of software. This difficulty is especially present in the development of research software, where software engineering activities are typically given less attention. To produce trustworthy results from research software, there is a need for a culture change so that testing is valued and teams devote appropriate effort to writing and executing tests.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The survey data is available in a public repository but set to private until publication of this paper (Carver and Eisty 2021)
References
Ackroyd K S, Kinder S H, Mant G R, Miller M C, Ramsdale C A, Stephenson P C (2008) Scientific software development at a research facility. IEEE Softw 25(4):44–51. https://doi.org/10.1109/MS.2008.93
Ammann P, Offutt J (2016) Introduction to software testing, 2nd edn. Cambridge University Press, Cambridge
Bourque P, Fairley RE (eds) (2014) SWEBOK: guide to the software engineering body of knowledge, version 3.0 edn. IEEE Computer Society, Los Alamitos. http://www.swebok.org/
Carver JC, Eisty N (2021) Testing research software: survey data. https://doi.org/10.6084/m9.figshare.16663561. https://figshare.com/articles/dataset/_/16663561/0
Clune T, Rood R (2011) Software testing and verification in climate model development. IEEE Softw 28(6):49–55. https://doi.org/10.1109/MS.2011.117
Drake J B, Jones P W, George R Carr J (2005) Overview of the software design of the community climate system model. Int J High Perform Comput Appl 19 (3):177–186. https://doi.org/10.1177/1094342005056094https://doi.org/10.1177/1094342005056094
Easterbrook S M (2010) Climate change: a grand software challenge. In: Proceedings of the FSE/SDP workshop on future of software engineering research, FoSER ’10. https://doi.org/10.1145/1882362.1882383. http://doi.acm.org/10.1145/1882362.1882383. ACM, pp 99–104
Easterbrook S M, Johns T C (2009) Engineering the software for understanding climate change. Comput Sci Eng 11(6):65–74. https://doi.org/10.1109/MCSE.2009.193
Eddins S L (2009) Automated software testing for matlab. Comput Sci Eng 11(6):48–55. https://doi.org/10.1109/MCSE.2009.186https://doi.org/10.1109/MCSE.2009.186
Eisty N U, Thiruvathukal G K, Carver J C (2018) A survey of software metric use in research software development. In: 2018 IEEE 14th international conference on e-science (e-science). https://doi.org/10.1109/eScience.2018.00036, pp 212–222
Farrell P E, Piggott M D, Gorman G J, Ham D A, Wilson C R, Bond T M (2011) Automated continuous verification for numerical simulation. Geosci Model Dev 4(2):435–449. https://doi.org/10.5194/gmd-4-435-2011https://doi.org/10.5194/gmd-4-435-2011. https://www.geosci-model-dev.net/4/435/2011/
Hannay J E, MacLeod C, Singer J, Langtangen H P, Pfahl D, Wilson G (2009) How do scientists develop and use scientific software?. In: 2009 ICSE workshop on software engineering for computational science and engineering. https://doi.org/10.1109/SECSE.2009.5069155, pp 1–8
Heaton D, Carver J C (2015) Claims about the use of software engineering practices in science: a systematic literature review. Inf Softw Tech 67:207–219. https://doi.org/10.1016/j.infsof.2015.07.011. http://www.sciencedirect.com/science/article/pii/S0950584915001342
Hill C (2016) Socio-economic status and computer use: designing software that supports low-income users. In: 2016 IEEE Symposium on visual languages and human-centric computing. https://doi.org/10.1109/VLHCC.2016.7739651, pp 1–1
Hochstein L, Basili V R (2008) The asc-alliance projects: a case study of large-scale parallel scientific code development. Computer 41(3):50–58. https://doi.org/10.1109/MC.2008.101
Hook D, Kelly D (2009) Testing for trustworthiness in scientific software. In: 2009 ICSE Workshop on software engineering for computational science and engineering. https://doi.org/10.1109/SECSE.2009.5069163https://doi.org/10.1109/SECSE.2009.5069163, pp 59–64
Kanewala U, Bieman J M (2014) Testing scientific software: a systematic literature review. Inf Softw Technol 56(10):1219–1232. https://doi.org/10.1016/j.infsof.2014.05.006
Katz D S, McInnes L C, Bernholdt D E, Mayes A C, Hong N P C, Duckles J, Gesing S, Heroux M A, Hettrick S, Jimenez R C, Pierce M, Weaver B, Wilkins-Diehr N (2019) Community organizations: changing the culture in which research software is developed and sustained. Comput Sci Eng 21(2):8–24. https://doi.org/10.1109/MCSE.2018.2883051
Kelly D, Sanders R, Saint R, Floor P, Sanders R, Kelly D (2008) The challenge of testing scientific software. In: Proceedings of the conference for the association for software testing, pp 30–36
Kelly D, Hook D, Sanders R (2009) Five recommended practices for computational scientists who write software. Comput Sci Eng 11(5):48–53. https://doi.org/10.1109/MCSE.2009.139
Kelly D, Thorsteinson S, Hook D (2011) Scientific software testing: analysis with four dimensions. IEEE Softw 28(3):84–90. https://doi.org/10.1109/MS.2010.88
Lemos G S, Martins E (2012) Specification-guided golden run for analysis of robustness testing results. In: 2012 IEEE Sixth international conference on software security and reliability. https://doi.org/10.1109/SERE.2012.28, pp 157–166
Miller G (2006) A scientist’s nightmare: software problem leads to five retractions. Science 314(5807):1856–1857. https://doi.org/10.1126/science.314.5807.1856. https://science.sciencemag.org/content/314/5807/1856, https://science.sciencemag.org/content/314/5807/1856.full.pdf
Murphy C, Raunak M, King A, Chen S, Imbraino C, Kaiser G, Lee I, Sokolsky O, Clarke L, Osterweil L (2011) On effective testing of health care simulation software. Technical Reports (CIS). https://doi.org/10.1145/1987993.1988003
Nguyen-Hoan L, Flint S, Sankaranarayana R (2010) A survey of scientific software development. In: Proceedings of the 2010 ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’10. https://doi.org/10.1145/1852786.1852802. http://doi.acm.org/10.1145/1852786.1852802. ACM, pp 12:1–12:10
Post D E, Kendall R P (2004) Software project management and quality engineering practices for complex, coupled multiphysics, massively parallel computational simulations: Lessons learned from asci. Int J High Perform Comput Appl 18(4):399–416. https://doi.org/10.1177/1094342004048534
Sanders R, Kelly D (2008) Dealing with risk in scientific software development. IEEE Softw 25(4):21–28. https://doi.org/10.1109/MS.2008.84
Segal J (2005) When software engineers met research scientists: a case study. Empir Softw Eng 10. https://doi.org/10.1007/s10664-005-3865-y
Segal J (2009) Software development cultures and cooperation problems: a field study of the early stages of development of software for a scientific community. Comput Supported Coop Work 18 (5):581. https://doi.org/10.1007/s10606-009-9096-9
Vilkomir S A, Swain W T, Poore J H, Clarno K T (2008) Modeling input space for testing scientific computational software: a case study. In: Bubak M, van Albada GD, Dongarra J, Sloot PMA (eds) Computational science—ICCS 2008. Springer, Berlin, pp 291–300
Acknowledgements
We thank the study participants and NSF-1445344.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
None
Additional information
Communicated by: Dietmar Pfahl
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Definitions Provided
We refer to Figs. 1 and 2 for the survey question. In this section we listed the definitions we provided in the actual survey.
-
Acceptance testing—Assess software with respect to requirements or users’ needs.
-
Architect—An individual who is a software development expert who makes high-level design choices and dictates technical standards, including software coding standards, tools, and platforms.
-
Assertion checking—Testing some necessary property of the program under test using a boolean expression or a constraint to verify.
-
Backward compatibility testing—Testing whether the newly updated software works well with an older version of the environment or not.
-
Branch coverage—Testing code coverage by making sure all branches in the program source code are tested at least once.
-
Boundary value analysis—Testing the output by checking if defects exist at boundary values.
-
Condition coverage—Testing code coverage by making sure all conditions in the program source code are tested at least once.
-
Decision table based testing—Testing the output by dealing with different combinations of inputs which produce different results.
-
Developer—An individual who writes, debugs, and executes the source code of a software application.
-
Dual coding—Testing the models created using two different algorithms while using the same or most common set of features.
-
Equivalence partitioning—Testing a set of the group by picking a few values or numbers to understood that all values from that group generate the same output.
-
Error Guessing—Testing the output where the test analyst uses his / her experience to guess the problematic areas of the application.
-
Executive—An individual who establishes and directs the strategic long term goals, policies, and procedures for an organization’s software development program.
-
Fuzzing test—Testing the software for failures or error messages that are presented due to unexpected or random inputs.
-
Graph coverage—Testing code coverage by mapping executable statements and branches to a control flow graph and cover the graph in some way.
-
Input space partitioning—Testing the output by dividing the input space according to logical partitioning and choosing elements from the input space of the software being tested.
-
Integration testing—Asses software with respect to subsystem design.
-
Logic coverage—Testing both semantic and syntactic meaning of how a logical expression is formulated.
-
Maintainer—An individual who builds source code into a binary package for distribution, commit patches or organize code in a source repository.
-
Manager—An individual who is responsible for overseeing and coordinating the people, resources, and processes required to deliver new software or upgrade existing products.
-
Metamorphic testing—Testing how a particular change in input of the program would change the output.
-
Module testing—Asses software with respect to detailed design.
-
Monte carlo test—Testing numerical results using repeated random sampling.
-
Performance testing—Testing some of the non-functional quality attributes of software like Stability, reliability, availability.
-
Quality Assurance Engineer—An individual who tracks the development process, oversee production, testing each part to ensure it meets standards before moving to the next phase.
-
State transition—Testing the outputs by changes to the input conditions or changes to ’state’ of the system.
-
Statement coverage—Testing code coverage by making sure all statements in the program source code are tested at least once.
-
Syntax-based testing—Testing the output using syntax to generate artifacts that are valid or invalid.
-
System testing—Asses software with respect to architectural design and overall behavior.
-
Test driven development—Testing the output by writing an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.
-
Unit testing—Asses software with respect to implementation.
-
Using machine learning—Testing the output values using different machine learning techniques.
-
Using statistical tests—Testing the output values using different statistical tests.
Appendix B: List of Testing Techniques
This appendix provides the list of testing techniques respondents mentioned they were familiar with in response to the survey question Q9. The numbers in the parenthesis represent how many respondents indicated that testing technique.
2.1 B.1 Testing Methods
Acceptance testing (9), Integration testing (43), System testing (14), Unit testing (87)
2.2 B.2 Testing Techniques
A/B testing (1), Accuracy testing (1), Alpha testing (1), Approval testing (2), Answer testing (1), Assertions testing (3), Behavioral testing (1), Beta testing (1), Bit-for-bit (1), Black-box testing (4), Built environment testing (1), Builtd testing (1), Checklist testing (1), Checksum (1), Compatibility Testing (1), Concolic testing (1), Correctness tests (1), Dependencies testing (1), Deployment testing (1), Dynamic testing (3), End-to-end testing (2), Equivalence class (1), Engineering tests (1), Exploratory tests (1), Functional testing (6), Fuzz testing (12), Golden master testing (1), Install testing (1), Jenkins automated testing (1), Load testing (1), Manual testing (2), Memory testing (6), Mock testing (6), Mutation testing (5), Penetration testing (1), Performance testing (6), Periodic testing (1), Physics testing (1), Property-based testing (2), Random input testing (2), Reference runs on test datasets (1), Regression testing (39), Reliability testing (1), Resolution testing (1), Scientific testing (2), Security testing (1), Smoke test (2), Statistical testing (1), Stress test (1), Usability testing (1), Use case test (1), User testing (2), Validation testing (9), White-box testing (2)
2.3 B.3 Testing Tools
CTest (3), gtest (1), jUnit (1)
2.4 B.4 Other types of QA
Code coverage (16), Code reviews (2), Documentation checking (3), Static analysis (6)
2.5 B.5 Others
Agile (1), Asan (1), Automatic test-case generation (1), Behavior-Driven Development (1), Bamboo (1), Benchmarking (1), Caliper (1), Code style checking (1), Coding standards (1), Comparison with analytical solutions (1), Continuous integration (33), Contracts (1), DBC (1), Design by contract (1), Doctests (1), Formal Methods (2), GitLab (1), License compliance (1), Linting (1), Method of exact solutions (1), Method of manufacture solution (2), Monitoring production apps (1), Msan (1), N-version (1), Nightly (1), Pre-commit (1), Profiling (1), Release (1), Run-time instrumentation and logging (1), Squish (1), Test-driven development (18), Test suites (1), Tsan (1), Visual Studio (2)
Rights and permissions
About this article
Cite this article
Eisty, N.U., Carver, J.C. Testing research software: a survey. Empir Software Eng 27, 138 (2022). https://doi.org/10.1007/s10664-022-10184-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-022-10184-9