Testing research software: a survey

Eisty, Nasir U.; Carver, Jeffrey C.

doi:10.1007/s10664-022-10184-9

Testing research software: a survey

Published: 26 July 2022

Volume 27, article number 138, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

922 Accesses
2 Citations
6 Altmetric
Explore all metrics

Abstract

Background

Research software plays an important role in solving real-life problems, empowering scientific innovations, and handling emergency situations. Therefore, the correctness and trustworthiness of research software are of absolute importance. Software testing is an important activity for identifying problematic code and helping to produce high-quality software. However, testing of research software is difficult due to the complexity of the underlying science, relatively unknown results from scientific algorithms, and the culture of the research software community.

Aims

The goal of this paper is to better understand current testing practices, identify challenges, and provide recommendations on how to improve the testing process for research software development.

Method

We surveyed members of the research software developer community to collect information regarding their knowledge about and use of software testing in their projects.

Results

We analysed 120 responses and identified that even though research software developers report they have an average level of knowledge about software testing, they still find it difficult due to the numerous challenges involved. However, there are a number of ways, such as proper training, that can improve the testing process for research software.

Conclusions

Testing can be challenging for any type of software. This difficulty is especially present in the development of research software, where software engineering activities are typically given less attention. To produce trustworthy results from research software, there is a need for a culture change so that testing is valued and teams devote appropriate effort to writing and executing tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Testing Research Software: A Case Study

Toward the characterization of software testing practices in South America: looking at Brazil and Uruguay

Article 04 July 2016

Excellence in Exploratory Testing: Success Factors in Large-Scale Industry Projects

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

https://se4science.org/workshops/
The survey data is available in a public repository but set to private until publication of this paper (Carver and Eisty 2021)
https://www.istqb.org/

References

Ackroyd K S, Kinder S H, Mant G R, Miller M C, Ramsdale C A, Stephenson P C (2008) Scientific software development at a research facility. IEEE Softw 25(4):44–51. https://doi.org/10.1109/MS.2008.93
Article Google Scholar
Ammann P, Offutt J (2016) Introduction to software testing, 2nd edn. Cambridge University Press, Cambridge
Book Google Scholar
Bourque P, Fairley RE (eds) (2014) SWEBOK: guide to the software engineering body of knowledge, version 3.0 edn. IEEE Computer Society, Los Alamitos. http://www.swebok.org/
Carver JC, Eisty N (2021) Testing research software: survey data. https://doi.org/10.6084/m9.figshare.16663561. https://figshare.com/articles/dataset/_/16663561/0
Clune T, Rood R (2011) Software testing and verification in climate model development. IEEE Softw 28(6):49–55. https://doi.org/10.1109/MS.2011.117
Article Google Scholar
Drake J B, Jones P W, George R Carr J (2005) Overview of the software design of the community climate system model. Int J High Perform Comput Appl 19 (3):177–186. https://doi.org/10.1177/1094342005056094 https://doi.org/10.1177/1094342005056094
Article Google Scholar
Easterbrook S M (2010) Climate change: a grand software challenge. In: Proceedings of the FSE/SDP workshop on future of software engineering research, FoSER ’10. https://doi.org/10.1145/1882362.1882383. http://doi.acm.org/10.1145/1882362.1882383. ACM, pp 99–104
Easterbrook S M, Johns T C (2009) Engineering the software for understanding climate change. Comput Sci Eng 11(6):65–74. https://doi.org/10.1109/MCSE.2009.193
Article Google Scholar
Eddins S L (2009) Automated software testing for matlab. Comput Sci Eng 11(6):48–55. https://doi.org/10.1109/MCSE.2009.186 https://doi.org/10.1109/MCSE.2009.186
Article Google Scholar
Eisty N U, Thiruvathukal G K, Carver J C (2018) A survey of software metric use in research software development. In: 2018 IEEE 14th international conference on e-science (e-science). https://doi.org/10.1109/eScience.2018.00036, pp 212–222
Farrell P E, Piggott M D, Gorman G J, Ham D A, Wilson C R, Bond T M (2011) Automated continuous verification for numerical simulation. Geosci Model Dev 4(2):435–449. https://doi.org/10.5194/gmd-4-435-2011 https://doi.org/10.5194/gmd-4-435-2011. https://www.geosci-model-dev.net/4/435/2011/
Article Google Scholar
Hannay J E, MacLeod C, Singer J, Langtangen H P, Pfahl D, Wilson G (2009) How do scientists develop and use scientific software?. In: 2009 ICSE workshop on software engineering for computational science and engineering. https://doi.org/10.1109/SECSE.2009.5069155, pp 1–8
Heaton D, Carver J C (2015) Claims about the use of software engineering practices in science: a systematic literature review. Inf Softw Tech 67:207–219. https://doi.org/10.1016/j.infsof.2015.07.011. http://www.sciencedirect.com/science/article/pii/S0950584915001342
Article Google Scholar
Hill C (2016) Socio-economic status and computer use: designing software that supports low-income users. In: 2016 IEEE Symposium on visual languages and human-centric computing. https://doi.org/10.1109/VLHCC.2016.7739651, pp 1–1
Hochstein L, Basili V R (2008) The asc-alliance projects: a case study of large-scale parallel scientific code development. Computer 41(3):50–58. https://doi.org/10.1109/MC.2008.101
Article Google Scholar
Hook D, Kelly D (2009) Testing for trustworthiness in scientific software. In: 2009 ICSE Workshop on software engineering for computational science and engineering. https://doi.org/10.1109/SECSE.2009.5069163 https://doi.org/10.1109/SECSE.2009.5069163, pp 59–64
Kanewala U, Bieman J M (2014) Testing scientific software: a systematic literature review. Inf Softw Technol 56(10):1219–1232. https://doi.org/10.1016/j.infsof.2014.05.006
Article Google Scholar
Katz D S, McInnes L C, Bernholdt D E, Mayes A C, Hong N P C, Duckles J, Gesing S, Heroux M A, Hettrick S, Jimenez R C, Pierce M, Weaver B, Wilkins-Diehr N (2019) Community organizations: changing the culture in which research software is developed and sustained. Comput Sci Eng 21(2):8–24. https://doi.org/10.1109/MCSE.2018.2883051
Article Google Scholar
Kelly D, Sanders R, Saint R, Floor P, Sanders R, Kelly D (2008) The challenge of testing scientific software. In: Proceedings of the conference for the association for software testing, pp 30–36
Kelly D, Hook D, Sanders R (2009) Five recommended practices for computational scientists who write software. Comput Sci Eng 11(5):48–53. https://doi.org/10.1109/MCSE.2009.139
Article Google Scholar
Kelly D, Thorsteinson S, Hook D (2011) Scientific software testing: analysis with four dimensions. IEEE Softw 28(3):84–90. https://doi.org/10.1109/MS.2010.88
Article Google Scholar
Lemos G S, Martins E (2012) Specification-guided golden run for analysis of robustness testing results. In: 2012 IEEE Sixth international conference on software security and reliability. https://doi.org/10.1109/SERE.2012.28, pp 157–166
Miller G (2006) A scientist’s nightmare: software problem leads to five retractions. Science 314(5807):1856–1857. https://doi.org/10.1126/science.314.5807.1856. https://science.sciencemag.org/content/314/5807/1856, https://science.sciencemag.org/content/314/5807/1856.full.pdf
Article Google Scholar
Murphy C, Raunak M, King A, Chen S, Imbraino C, Kaiser G, Lee I, Sokolsky O, Clarke L, Osterweil L (2011) On effective testing of health care simulation software. Technical Reports (CIS). https://doi.org/10.1145/1987993.1988003
Nguyen-Hoan L, Flint S, Sankaranarayana R (2010) A survey of scientific software development. In: Proceedings of the 2010 ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’10. https://doi.org/10.1145/1852786.1852802. http://doi.acm.org/10.1145/1852786.1852802. ACM, pp 12:1–12:10
Post D E, Kendall R P (2004) Software project management and quality engineering practices for complex, coupled multiphysics, massively parallel computational simulations: Lessons learned from asci. Int J High Perform Comput Appl 18(4):399–416. https://doi.org/10.1177/1094342004048534
Article Google Scholar
Sanders R, Kelly D (2008) Dealing with risk in scientific software development. IEEE Softw 25(4):21–28. https://doi.org/10.1109/MS.2008.84
Article Google Scholar
Segal J (2005) When software engineers met research scientists: a case study. Empir Softw Eng 10. https://doi.org/10.1007/s10664-005-3865-y
Segal J (2009) Software development cultures and cooperation problems: a field study of the early stages of development of software for a scientific community. Comput Supported Coop Work 18 (5):581. https://doi.org/10.1007/s10606-009-9096-9
Article Google Scholar
Vilkomir S A, Swain W T, Poore J H, Clarno K T (2008) Modeling input space for testing scientific computational software: a case study. In: Bubak M, van Albada GD, Dongarra J, Sloot PMA (eds) Computational science—ICCS 2008. Springer, Berlin, pp 291–300

Download references

Acknowledgements

We thank the study participants and NSF-1445344.

Author information

Authors and Affiliations

Department of Computer Science, Boise State University, Boise, ID, USA
Nasir U. Eisty
Department of Computer Science, University of Alabama, Tuscaloosa, AL, USA
Jeffrey C. Carver

Authors

Nasir U. Eisty
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey C. Carver
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nasir U. Eisty.

Ethics declarations

Conflict of Interest

None

Additional information

Communicated by: Dietmar Pfahl

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Definitions Provided

We refer to Figs. 1 and 2 for the survey question. In this section we listed the definitions we provided in the actual survey.

Acceptance testing—Assess software with respect to requirements or users’ needs.
Architect—An individual who is a software development expert who makes high-level design choices and dictates technical standards, including software coding standards, tools, and platforms.
Assertion checking—Testing some necessary property of the program under test using a boolean expression or a constraint to verify.
Backward compatibility testing—Testing whether the newly updated software works well with an older version of the environment or not.
Branch coverage—Testing code coverage by making sure all branches in the program source code are tested at least once.
Boundary value analysis—Testing the output by checking if defects exist at boundary values.
Condition coverage—Testing code coverage by making sure all conditions in the program source code are tested at least once.
Decision table based testing—Testing the output by dealing with different combinations of inputs which produce different results.
Developer—An individual who writes, debugs, and executes the source code of a software application.
Dual coding—Testing the models created using two different algorithms while using the same or most common set of features.
Equivalence partitioning—Testing a set of the group by picking a few values or numbers to understood that all values from that group generate the same output.
Error Guessing—Testing the output where the test analyst uses his / her experience to guess the problematic areas of the application.
Executive—An individual who establishes and directs the strategic long term goals, policies, and procedures for an organization’s software development program.
Fuzzing test—Testing the software for failures or error messages that are presented due to unexpected or random inputs.
Graph coverage—Testing code coverage by mapping executable statements and branches to a control flow graph and cover the graph in some way.
Input space partitioning—Testing the output by dividing the input space according to logical partitioning and choosing elements from the input space of the software being tested.
Integration testing—Asses software with respect to subsystem design.
Logic coverage—Testing both semantic and syntactic meaning of how a logical expression is formulated.
Maintainer—An individual who builds source code into a binary package for distribution, commit patches or organize code in a source repository.
Manager—An individual who is responsible for overseeing and coordinating the people, resources, and processes required to deliver new software or upgrade existing products.
Metamorphic testing—Testing how a particular change in input of the program would change the output.
Module testing—Asses software with respect to detailed design.
Monte carlo test—Testing numerical results using repeated random sampling.
Performance testing—Testing some of the non-functional quality attributes of software like Stability, reliability, availability.
Quality Assurance Engineer—An individual who tracks the development process, oversee production, testing each part to ensure it meets standards before moving to the next phase.
State transition—Testing the outputs by changes to the input conditions or changes to ’state’ of the system.
Statement coverage—Testing code coverage by making sure all statements in the program source code are tested at least once.
Syntax-based testing—Testing the output using syntax to generate artifacts that are valid or invalid.
System testing—Asses software with respect to architectural design and overall behavior.
Test driven development—Testing the output by writing an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.
Unit testing—Asses software with respect to implementation.
Using machine learning—Testing the output values using different machine learning techniques.
Using statistical tests—Testing the output values using different statistical tests.

Appendix B: List of Testing Techniques

This appendix provides the list of testing techniques respondents mentioned they were familiar with in response to the survey question Q9. The numbers in the parenthesis represent how many respondents indicated that testing technique.

2.1 B.1 Testing Methods

Acceptance testing (9), Integration testing (43), System testing (14), Unit testing (87)

2.2 B.2 Testing Techniques

A/B testing (1), Accuracy testing (1), Alpha testing (1), Approval testing (2), Answer testing (1), Assertions testing (3), Behavioral testing (1), Beta testing (1), Bit-for-bit (1), Black-box testing (4), Built environment testing (1), Builtd testing (1), Checklist testing (1), Checksum (1), Compatibility Testing (1), Concolic testing (1), Correctness tests (1), Dependencies testing (1), Deployment testing (1), Dynamic testing (3), End-to-end testing (2), Equivalence class (1), Engineering tests (1), Exploratory tests (1), Functional testing (6), Fuzz testing (12), Golden master testing (1), Install testing (1), Jenkins automated testing (1), Load testing (1), Manual testing (2), Memory testing (6), Mock testing (6), Mutation testing (5), Penetration testing (1), Performance testing (6), Periodic testing (1), Physics testing (1), Property-based testing (2), Random input testing (2), Reference runs on test datasets (1), Regression testing (39), Reliability testing (1), Resolution testing (1), Scientific testing (2), Security testing (1), Smoke test (2), Statistical testing (1), Stress test (1), Usability testing (1), Use case test (1), User testing (2), Validation testing (9), White-box testing (2)

2.3 B.3 Testing Tools

CTest (3), gtest (1), jUnit (1)

2.4 B.4 Other types of QA

Code coverage (16), Code reviews (2), Documentation checking (3), Static analysis (6)

2.5 B.5 Others

Agile (1), Asan (1), Automatic test-case generation (1), Behavior-Driven Development (1), Bamboo (1), Benchmarking (1), Caliper (1), Code style checking (1), Coding standards (1), Comparison with analytical solutions (1), Continuous integration (33), Contracts (1), DBC (1), Design by contract (1), Doctests (1), Formal Methods (2), GitLab (1), License compliance (1), Linting (1), Method of exact solutions (1), Method of manufacture solution (2), Monitoring production apps (1), Msan (1), N-version (1), Nightly (1), Pre-commit (1), Profiling (1), Release (1), Run-time instrumentation and logging (1), Squish (1), Test-driven development (18), Test suites (1), Tsan (1), Visual Studio (2)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eisty, N.U., Carver, J.C. Testing research software: a survey. Empir Software Eng 27, 138 (2022). https://doi.org/10.1007/s10664-022-10184-9

Download citation

Accepted: 26 May 2022
Published: 26 July 2022
DOI: https://doi.org/10.1007/s10664-022-10184-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Testing research software: a survey