survey

Open access

Testing, Validation, and Verification of Robotic and Autonomous Systems: A Systematic Review

Authors:

Hugo Araujo,

Mohammad Reza Mousavi,

Mahsa VarshosazAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2

Article No.: 51, Pages 1 - 61

https://doi.org/10.1145/3542945

Published: 30 March 2023 Publication History

All formats PDF

Abstract

We perform a systematic literature review on testing, validation, and verification of robotic and autonomous systems (RAS). The scope of this review covers peer-reviewed research papers proposing, improving, or evaluating testing techniques, processes, or tools that address the system-level qualities of RAS.

Our survey is performed based on a rigorous methodology structured in three phases. First, we made use of a set of 26 seed papers (selected by domain experts) and the SERP-TEST taxonomy to design our search query and (domain-specific) taxonomy. Second, we conducted a search in three academic search engines and applied our inclusion and exclusion criteria to the results. Respectively, we made use of related work and domain specialists (50 academics and 15 industry experts) to validate and refine the search query. As a result, we encountered 10,735 studies, out of which 195 were included, reviewed, and coded.

Our objective is to answer four research questions, pertaining to (1) the type of models, (2) measures for system performance and testing adequacy, (3) tools and their availability, and (4) evidence of applicability, particularly in industrial contexts. We analyse the results of our coding to identify strengths and gaps in the domain and present recommendations to researchers and practitioners.

Our findings show that variants of temporal logics are most widely used for modelling requirements and properties, while variants of state-machines and transition systems are used widely for modelling system behaviour. Other common models concern epistemic logics for specifying requirements and belief-desire-intention models for specifying system behaviour. Apart from time and epistemics, other aspects captured in models concern probabilities (e.g., for modelling uncertainty) and continuous trajectories (e.g., for modelling vehicle dynamics and kinematics).

Many papers lack any rigorous measure of efficiency, effectiveness, or adequacy for their proposed techniques, processes, or tools. Among those that provide a measure of efficiency, effectiveness, or adequacy, the majority use domain-agnostic generic measures such as number of failures, size of state-space, or verification time were most used. There is a trend in addressing the research gap in this respect by developing domain-specific notions of performance and adequacy. Defining widely accepted rigorous measures of performance and adequacy for each domain is an identified research gap.

In terms of tools, the most widely used tools are well-established model-checkers such as Prism and Uppaal, as well as simulation tools such as Gazebo; Matlab/Simulink is another widely used toolset in this domain.

Overall, there is very limited evidence of industrial applicability in the papers published in this domain. There is even a gap considering consolidated benchmarks for various types of autonomous systems.

1 Introduction

1.1 Motivation

Robotic and Autonomous Systems (RAS) involve a rich integration of several disciplines such as control engineering and robotics, mechanical engineering, electronics, and software engineering. Validation and verification of RAS entails a non-trivial extension of traditional testing techniques to deal with their multi-disciplinary nature. In particular, for researchers and practitioners from the software testing community, extending the existing software testing techniques to RAS is a challenge that has led to a sizeable literature on proposing and evaluating different techniques and processes. This rich literature calls for a secondary study that puts a structure to this landscape and identifies relative strengths and weakness of available results. The present article addresses this gap by performing a structured literature survey of RAS testing.

There are a number of earlier surveys on related topics; we provide an in-depth comparison of related work with our survey in Section 2. However, briefly speaking, some of these surveys have a different or more confined scope, e.g., considering machine learning components [79], formal specification and verification techniques [138] or driving datasets [108], or do not aim to provide a structure overview of the field to answer concrete questions for a given audience [20]. To our knowledge, this is the first systematic secondary study that covers the breadth of results in testing RAS (see the Related Work section for other studies with different foci) and moreover, provides an analysis of such results with the aim of characterising the type of techniques, process and analyse their evidence of applicability (in terms of tools and type of case study).

1.2 Scope and Audience

Our scope covers novel results (including techniques, process, tools, and applications thereof) that deal with testing robotic and autonomous systems. We call such novel results “interventions,” following the tradition in medical secondary studies, as well as recent systematic reviews in testing [3]. In our terminology, an intervention is “an act performed (e.g., use of a technique or a process change) to adapt testing to a specific context, to solve a test issue, to diagnose testing, or to improve testing” [68]. The scope of our survey includes several validation and verification techniques, including physical testing, model-based testing, runtime monitoring, formal verification, and model checking.

Our audience are both researchers and practitioners in software and systems engineering. Hence, we perform our analysis from two perspectives:

(1)

researchers: to identify strengths and gaps in the research landscape of testing RAS, particularly concerning the traditional software testing taxonomies, are there new challenges not covered by software testing taxonomies?; and

(2)

practitioners: identify interventions that have the evidence of applicability given the environment and available resources.

We provide a precise definition of RAS in the remainder of this article to derive rigorous inclusion and exclusion criteria. But in a nutshell, for our interventions to be useful for the intended audience, we confine our scope to those interventions that

(1)

address testing the computer systems integrated in RAS (as opposed to only physical, mechanical, or control parts) in their methodology; this is justified by the fact that our intended audience are researchers and practitioners in software and systems engineering,

(2)

have some evidence of applicability, efficiency, or effectiveness on RAS; this is motivated by our scope (testing, validation, and verification of RAS) as well as our goal to provide evidence of strength (or weakness) for researchers and practitioners, and

(3)

take the system-level validation and verification into account and do not focus on a specific unit or component of such systems (e.g., a specific type of learning or planning algorithms or testing of physical or mechanical parts of such systems); this is motivated by the inherent multi-disciplinary of RAS and the requirement for accommodating it for any system-level testing RAS.

Next, we define a number of research questions that help us structure and analyse the existing interventions for the two groups of audience.

1.3 Research Questions

As specified above, we would like to review and analyse those interventions that are applicable to testing, validation, and verification of RAS; in particular, we have an emphasis on those intervention that take into account the computer systems in RAS and their interactions with their physical environment and human users. In the remainder of this section and throughout the rest of the article, we use the term testing to refer to various testing, validation, and verification techniques.

A structured method of testing, validation, or verification is often steered by models, describing the structure or the behaviour of the system under test. The type of models often determines the type of analysis that can be applied and hence, has a far-reaching effect on the applicability and effectiveness of the technique. However, not all included interventions are model-based (or even related to test cases), as we also consider other forms of verification such as runtime monitoring. Moreover, the metrics of effectiveness, efficiency, and coverage used to evaluate the system under test and the intervention itself are both a major factor in determining the intervention’s applicability and, hence, form a major part of our research questions. Finally, the case studies performed to evaluate the technique are a major source of evidence for applicability. Based on these observations, our research questions are specified below:

(1)

What are the types of models used for testing RAS?

We interpret the word “model” liberally as any information source or domain abstraction that is used to structure or steer the testing process or evaluate the outcome of testing. This helps us understand and decide about the type of abstractions that are commonly used or needed for testing RAS. They help both researchers and practitioners identify the types/aspects of RAS that can be addressed using the current testing interventions and also the type of information that need to be made available for these interventions to be applicable. It also points out to aspects of RAS that are currently not covered by the current interventions. In line with the above specified goals, we analyse two types of models: those that address the system under test or its environment, versus models that describe its quality attributes.

(2)

Which efficiency, effectiveness, and coverage measures were introduced or used to evaluate RAS testing interventions?

Efficiency refers to the amount of time and resource needed for an intervention to achieve its goal. Effectiveness refers to the type and the number of faults recovered by a testing intervention and coverage refers to any measure that is used to decide the adequacy and the stopping criteria for a testing intervention. Answering these two questions also provides researchers and practitioners with the available evidence for the strength and applicability of the existing techniques, process, and tools.

(3)

What are the interventions supported by (publicly available) tools in this domain?

Tool support is a key enabler to the application of testing interventions in practice and their integration with other interventions in research contexts. We analyse the literature regarding this research question by providing information about the tooling available for and needed for each intervention; we call the first class of tools, i.e., those tools that are developed to support a particular intervention, effect tools, and those tools used and needed for the effect tools to function. The second category, called context tools, provides further information about what is needed for a particular intervention to be automated in its context. We also report about the license information, when available, to facilitate decision-making.

(4)

Which interventions have evidence of applicability to large-scale and industrial systems?

We gather evidence from the reviewed interventions in terms of case studies and classify them into small-scale, benchmarks, and industrial case studies.

1.4 Structure of the Article

The remainder of this article is structured as follows: In Section 2, we review related work, with a focus on secondary studies (literature surveys and reviews) on related subject matters. In Section 3, we define the scope of the article and explain the background to this structured review. There, we report on the core set of results we started with as the seed for our search to shape the study. In Section 4, we review the methodology we used for the our systematic review; this includes the description of our search and selection strategy, the development of the taxonomy used for coding the results, our data extraction, and synthesis methods. In this section, we also reflect on the threats to our study. In Section 5, we present the results of our coding and analyse them to answer our research questions. In Section 6, we reflect on our analysis and provide concrete suggestions for our target audience, i.e., both for researchers and practitioners. In Section 7, we conclude the article and present some directions of future research.

2 Related Work

There are a number of literature reviews, surveys, and mapping studies conducted that cover different aspects of robotic and autonomous systems. In what follows, we give an overview of the ones that are most related and have the closest connection to our study (in chronological order).

Cortesi, Ferrara, and Chaki [55] discuss the features of a number of analysis techniques, namely, data-flow analysis, control-flow analysis, model-checking, and abstract interpretation. The survey covers features such as automation, precision, scalability, and soundness for these techniques. The goal for the study is stated as providing robotics software developers hints to help choosing appropriate analysis approaches, depending on the kind of properties of interest and software system. However, the interventions studied in this article are not necessarily applied in the robotics domain already. Furthermore, the work is not a systematic review and does not claim providing any coverage on existing work on analysis techniques applied in its target application domain.

Helle, Schamai, and Strobel [99] as well as Redfield and Seto [181] provide an overview of challenges in and available techniques and results for testing and verification of autonomous systems. Both studies only sample a small subset of available results and techniques and use them to identify the areas requiring future research. Our findings, based on a much larger set, provide a much more refined view about the available interventions and the landscape for future research.

Koopman and Wagner [118] give an overview of challenges in the V model adapted to deal with the problems in the context of autonomous vehicles. The paper identifies five major challenge areas in testing according to the V model for autonomous vehicles, namely, driver out of the loop, complex requirements, non-deterministic algorithms, inductive learning algorithms, and fail operational systems. The paper covers solution approaches that seem promising across these different challenges including phased deployment using successively relaxed operational scenarios, and using a monitor/actuator pair architecture to separate complex automated and autonomous functions from simpler safety functions, and fault injection. Similar to the previous two papers, the work of Koopman and Wagner has a more restrictive scope than the present article; moreover, the above-mentioned work is not a (structured) review of the literature.

Gao and Tan [79] provide an overview of the state-of-the-art in V&V for safety-critical systems that rely on machine learning techniques (based on deep learning) for autonomous driving. In this work, the researchers first extract a set of studies by conducting a search and identify a set of challenges by reviewing these studies. Then, the validity of the identified challenges is checked by setting up an industrial questionnaire to survey. Furthermore, a set of research recommendations is provided for future work in automated driving based on deep learning. The search query used in this study is more limited than ours in scope, because it focuses on testing for automated driving and deep learning, while we cover robotic and autonomous systems in a much broader sense. The articles covered in this study are published before 2017.

Knauss et al. [114] present an empirical study for investigating software-related challenges of testing automated vehicles. In the work, two different kinds of data collection, namely, focus groups (including 11 practitioners from Sweden) and interviews (including 15 practitioners and researchers from a number of countries), are used. The work provides insights about challenges such as virtual testing and simulation, standards and certifications, increased need to test nonfunctional aspects, and automation. This work is not a systematic mapping.

Rao and Frtunikj [180] identify three concrete issues regarding assessment of functional safety of neural networks used in automotive industry to initiate the discussion with industrial peers to find practical solutions. The issues include: dataset completeness, neural network implementation, and transfer learning.

Kang, Yin, and Berger [108] provide a survey of publicly available driving datasets as well as virtual testing for autonomous driving algorithms. A detailed overview of 37 datasets for open-loop testing and 22 virtual testing environments for closed-loop testing have been provided. A remarkable aspect of this survey is the involvement of an industrial domain expert. The scope and results of the paper are significantly different from ours: They focus on autonomous driving algorithms, while we include the whole domain of RAS; they focus on datasets and tools, while we focus on interventions and their effects, as well as their tools.

Beglerovic, Metzner, and Horn [20] provide a brief overview of methodologies used for testing in automated driving. The work provides recommendations about promising methodologies and research areas aimed to reduce the testing effort. The authors mention challenges such as complexity of automated driving functions, variation of scenarios and parameters, scenario selection, and test generation. Furthermore, the work briefly touches upon validation, supporting tools in the validation task, and standardisation. This paper is significantly different in methodology from ours: It is not a mapping study and does not provide any detail about the coverage of existing work.

Luckcuck et al. [138] provide a survey of formal specifications and verification methods and tools used for autonomous robotics systems. The work covers a range of studies from 2007–2018. In their work, a number of challenges for formally modelling and verifying the environments that the robotic systems operate in in addition to the internals of such systems is provided. Their work differs from ours, as it only covers formal specification and verification tools for such systems. Hence, techniques such as (non-exhaustive) testing and simulation are not covered in their work. Also, our work has a different methodological approach in that we pose and answer research questions as the result of our secondary study, while they focus on the literature review itself. We did use the studies reviewed by Luckcuck et al. to validate and refine our search query in the third phase of our research.

Gleischner, Foster, and Woodcock [89] provide an overview of the strengths, weaknesses, opportunities, and threats in the application of integrated Formal Methods to robotic and autonomous systems. Some of their findings, such as the gaps concerning evidence of effectiveness and tool support, reinforce our findings and some, such as the challenges in training, are complementary to those of the present article. We believe some of the complementary findings arise from the general experience and findings about the application of formal methods, which goes broader than the scope of a survey in the domain of robotic and autonomous systems.

Tahir and Alexander [207] perform a systematic literature review on coverage-based validation, verification, and safety assurance techniques for autonomous vehicles. The scope of their survey is much more confined than ours. They do code different coverage criteria as an answer for one of their research questions, which has an overlap with our goal with identifying the coverage criteria. We have used their included papers to validate our search query as a part of our third phase methodology.

Rajabali et al. [178] perform an extensive and systematic literature review on software validation and verification for autonomous vehicles. Their scope is more restricted than the scope of the present study, but some of their research questions (such as identifying gaps in the literature) are common to ours. However, their methodology does not involve a detailed taxonomy as in the present study and, hence, their conclusions are more abstract and at a higher level. We have also used this recent paper to validate the query and the final set of considered papers in the third phase of our research.

3 Background AND Rationale

In this section, we provide an overview of the motivation behind this literature survey and define its domain. Subsequently, we introduce the basic taxonomy that we have extended and adapted for coding the literature. We also review the pilot study that was used to shape our taxonomy (and later validate our search query, presented in the next section).

3.1 Motivation

Based on our study of the existing literature reviews and surveys, we identified the gap for a secondary study that (1) presents a structured review of the existing results on validation and verification of robotic and autonomous systems and (2) targets specific research questions regarding (a) the types of models, (b) measures of efficiency and effectiveness, (c) available tools, and (d) evidence of applicability to large-scale and industrial systems.

3.2 Robotic and Autonomous System

There is a variety of definitions for our domain, RAS; these definitions encompass aspects such as autonomy (including high-level decision-making and planning) and adaptation (including artificial intelligence and machine learning) and interaction with human users and the physical environment (including perception, actuation, and mobility). In our view, the following definition provides a concise synthesis of these aspects:

An autonomous system is an intelligent system that is designed to deal with the physical environment on its own and work for extended periods of time without explicit human intervention. They are built to analyse, learn from, adapt to, and act on the surrounding environment.

This definition is inspired by and merges some complementary aspects in the earlier definitions given by the Royal Academy of Engineering [164] and the National Science Foundation [76]. We emphasise two important aspects of this definition: one is the system-level perspective; hence, modules or units of software and hardware that are not autonomous systems themselves will not be included in our studies; the second important aspect is the interaction with the environment; hence, autonomous systems that work on offline data and do not feature an interaction with their environment are excluded as well.

3.3 Testing and the SERP-Test Taxonomy

In this work, we consider a testing intervention as any structured approach to validate or verify the quality of a robotic and autonomous system. Validation concerns checking the system specification, design, or implementation against user requirements. Verification concerns checking the system specification, design, or implementation against another piece of specification, design, or implementation. In other words, validation checks whether we have built the right system (for its users), while verification checks whether we have built it correctly (with respect to other specifications and artefacts) [173].

Our classification of testing research is based on the SERP-Test taxonomy [68]. This taxonomy provides a very general framework for classifying and communicating software testing research and has been used and adapted for this purpose across different domains [3, 183]. It serves as a useful tool for researchers and practitioners to select a testing process or technique based on the available resources or the expected evidence of applicability, effectiveness, and efficiency. In SERP-Test, testing research is classified in terms of four facets: intervention, effect, scope, and context. Intervention pertains to the test techniques, their adaptation, and adoption in different context. Effect facet is used to identify the improvement or adaptation in a given practise as well as any insights gained through assessment. The scope specified whether the effect has been materialised in planning, design, execution, or analysis of tests. Context, as its name suggests, specifies the environment where the intervention takes place, in terms of people and their knowledge, the system under test, and the required models and other types of information.

In the next section, we report on the methodology of this study; namely, in Section 4.1, we discuss the seed papers that formed the basis of our search; in Section 4.2, we report on the final inclusion and exclusion criteria; and in Section 4.3, we report on the adapted taxonomy. In Section 4.4, we report on the search query and its validation with respect to the seed papers; Finally, in Section 4.5, we detail our strategy to extract data from the set of included papers.

4 Methodology

In this section, we present the methodology used throughout our study that encompasses three phases. In the first phase, a pilot study was conducted in which we gathered a set of seed papers (Section 4.1), developed a set of inclusion/exclusion criteria (Section 4.2), and refined our taxonomy (Section 4.3). In the second phase (Section 4.4.1), we performed the search, applied the exclusion criteria, and coded the selected papers. In the final phase (Section 4.4.2), the search query was validated and refined via an analysis of the secondary studies on the subject and, also, in consultation with domain experts; a new search was performed and additional studies were included for review and coding. Finally, in Section 4.5, we present our strategy for further filtering papers by their content and an overview of the outcomes.

A repository containing artefacts of this study (namely, the seed papers, the result of the searches, and the coding) is publicly available.¹

4.1 Seed Papers

The set of seed papers contains 26 manually selected studies gathered in consultation with domain experts: three experts from academia with 32, 23, and 19 years of experience and one expert from industry with 26 years of experience in computer systems testing and verification domains. We reviewed this set as a pilot study with the following objectives:

(1)

gathering keywords for the initial search query,

(2)

sharpening the inclusion and exclusion criteria, and

(3)

evaluating and adapting the SERP-Test taxonomy.

4.2 Selection Strategy

To set the boundaries for the scope of our study, based on our research questions, we defined and used a set of inclusion/exclusion criteria as follows:

4.2.1 Inclusion Criteria.

The criteria considered for inclusion of studies is as follows:

•

The topic of the study is on Testing RAS (Robotics and Autonomous Systems),

•

The context must consider the cyber and physical aspects of a system (as opposed to only physical, mechanical, or control parts.), and

•

Evidence for applicability is provided.

In the scope of our study, Testing is interpreted in a broad sense, which includes formal verification techniques, static and dynamic testing, validation and non-exhaustive techniques.

4.2.2 Exclusion Criteria.

The studies matching the following criteria are excluded:

•

Not available online,

•

Not in English,

•

Short papers,

•

Not peer-reviewed,

•

Patents,

•

Published before 2008 (in the second phase), published before 2014 (in the third phase),

•

Not addressing robotics and autonomous systems,

•

No research contributions to testing (including validation or verification),

•

Only testing units in isolation; not considering the robotics and autonomous systems as a whole. (If the contribution for testing units are not specific to the system considered in the paper and can have applications in the bigger context, then we included the study.),

•

The study only considers the physical aspects of the system and not software components,

•

Concerning human-controlled systems, e.g., UAVs and robots that are remotely controlled by a human, and

•

For papers on the topic of simulation, as there are a large number of studies among the search results that do not have new contributions in the process or technique of testing interventions; we consider excluding such papers unless they provide clear contributions in the context of testing, validation, and verification, have available tool, or provide evidence of applicability in industrial context.

4.3 Taxonomy

To consistently classify the set of included studies to extract the information required for answering the research questions described in Section 1.3, we follow a modified version of the SERP-Test taxonomy (see Section 3.3). We started with the high-level facets proposed in SERP-TEST taxonomy and throughout a number of iterations we defined and re-defined a number of categories based on the information obtained from coding the included studies. The extracted data from each facet has been used to answer the research questions and to identify strengths and gaps (provided to researchers and practitioners) as part of our analysis in Section 6. An overview of the final taxonomy, based on which the studies are classified, is depicted on Figure 1.

Fig. 1.

As follows, we provide a brief description of our taxonomy:

Context. For context, we consider two main categories, namely, system under test and the technique.

•

System Under Test. System under test describes the type of systems on which the testing technique is applied. In our study, we consider two main categories of RAS, namely, Robotics and Autonomous Systems. These two categories of systems are selected, as they dominate the case studies and a broad range of systems that are considered in the studies concerning testing RAS.

•

Technique. This is the second category that is considered under Context, which represents the testing technique that is improved or affected as a part of the contributions of the work to testing RAS.

•

Models. Different types of models can be used for describing the behaviour of a system under test. We consider this category to extract the information about the variety of models that are used in the work on testing RAS.

•

Tools and languages. This category consists of details on tools and languages under which the subject systems are described.

Effect. We refine the Effect facet (see Section 3.3) further to four categories as follows:

•

Metrics. This category encompasses the metrics used as a way of evaluating test adequacy or correctness of the subject, based on performance (i.e., efficiency and effectiveness) or coverage measures.

–

Performance. This category describes the effect of an intervention on the performance of the testing technique or the subject system. The performance covers a variety of measures concerning safety, quality, and resources observed during testing.

–

Coverage. This category concerns the measures that indicate how comprehensive the testing technique is once performing in the context of RAS.

•

Process. This category describes the kind of effects that impact the process of testing technique.

•

Technique. The technique concerns methods presented as new testing methods or improvements for testing RAS.

•

Tooling In this category, we extract information about the type of tools that have been used throughout each work. We further classify the tools according to their availability: (1) open source, which are tools for which the source artefacts are available, (2) publicly available, which are tools that are accessible to be used but the source code has not been provided and (3) private, which are tools that have not been made available for download or purchase.

Scope. This facet in SERP-Test taxonomy is further refined to two main categories as follows:

•

Model testing. This category represents techniques that use a model of the system for testing. We define two sub-categories for such techniques:

–

Simulation. This category comprises different types of simulation techniques used for testing RAS.

–

Formal verification. This category describes formal verification techniques that use a model of the system to rigorously verify the behaviour.

•

System testing. This category describes techniques that are applied on actual implementation artifacts of systems.

–

Static testing. This category describes techniques that perform testing of system without code execution.

–

Dynamic testing. This category describes techniques that check the functional behaviour by executing the implemented code for the system.

Evaluation. We define the case study category, which has three main subcategories for this facet in SERP-Test taxonomy (see Section 3.3)

•

Case Study. This category specifies the type of systems that has been used in evaluations of the selected papers. We categorise the case studies into three subcategories, namely, small scale, benchmark, and industrial.

–

Small. We consider examples that are developed solely for the purpose of evaluating the method in a specific study and are not applicable for evaluating other similar intervention (due to lack of available details, lack of genericity, or insufficient scale/number of subject systems) as small scale.

–

Benchmark. We consider a case study as benchmark if it represents a set of systems with sufficient level of details such that they are/can be used as a point of reference in the evaluations performed in the context of testing autonomous systems.

–

Industrial. We categorise a case study as industrial if the subject system is of industrial scale and the evaluation has been performed in industrial context.

4.4 Search Strategy

A total of four searches have been conducted. Following after the initial search, three additional searches were conducted to account for our own internal validation and, also, external validation from domain experts. In addition to Google Scholar, two digital libraries, namely, ACM and IEEE, that broadly cover publications with topics in computer science and engineering fields, have been selected as search venue.

4.4.1 Initial Query.

From the seed papers, an initial set of keywords was extracted to form a search query; additional terms with close meanings and relation to the initial keywords were used to broaden the search. Our query is a conjunction of two main sub-queries: one that comprises terms relevant to our application domain, robotic and autonomous systems, and the other contains the terms related to testing and verification. The initial query was as follows:

(“Robots” OR “Robotics” OR “Deep learning” OR “Machine Learning” OR “Artificial Intelligence” OR “Robot Simulator” OR “Autonomous Vehicle” OR “Autonomous Vehicles” OR “Autonomous Cars” OR “Image Classification Systems” OR “Neural Networks” OR “Unmanned Vehicles” OR “Unmanned Aerial Vehicles” OR “UAV” OR “Connected and Autonomous Vehicle” OR “CAV” OR “Automated Functions” OR “Drive Assist” OR “Multi-Agent Systems” OR “Autonomous Agents”)

AND

(“Testing” OR “Validation” OR “Verification” OR “Safety Case Analysis” OR “Runtime Monitoring” OR “Robustness” OR “Simulation” OR “Coverage” OR “Metaheuristics” OR “Search-Based” OR “Combinatorial” OR “SMT Solving” OR “SAT Solving” OR “Constraint Solving” OR “Model Checking”)

For this first search, we limited the scope of our search to papers published between 2008 and 2018. Its outcome was a set of 3,030 studies.

4.4.2 Validated Query.

During our validation process, we made use of the seed papers, secondary studies (by checking papers that were referenced among included papers but were not an outcome of the search, i.e., snowballing technique) and domain specialists. We approached 50 academics and 15 industry experts in the domains of testing and verification to validate the outcome of the above-given search. They provide expertise in several areas, including verification and validation (31 experts with a median of 18 years of experience), artificial intelligence (8 experts with a median of 12 years of experience), human factors (5 experts with a median of 11 years of experience), and robotics and control systems (9 experts with a median of 14 years of experience). Of that group, we received detailed comments from 8 experts—7 academics and 1 from industry with an average 18.25 and median 26 years of experience in the field. This resulted in three revisions of our search query.

In the first revision, we included additional keywords (“Robot,” “Robotic,” “Swarm,” “Swarms,” “UAVs,” “Automated Driving,” “ADAS,” “Verifying,” “Verifiably,” “Assurance,” and “Assuring”), removed keywords that did not result in coded papers (“Machine Learning,” “Deep Learning,” “Artificial Intelligence,” “Image Classification System,” “Neural Networks,” “Robustness,” “Coverage,” and “Combinatorial”), and swapped terms for more generic ones (“Autonomous Vehicles” and “Autonomous Cars” were swapped for “Autonomous”). Furthermore, we observed that from years 2008 to 2014, only a handful of papers were included; this led us to further focus the search to papers published between 2014 and 2018.

In the second revision, we added the terms “Driveless” and “Self-driving.” Finally, in the third revision, to increase the relevancy of our results, we also included papers from 2019. The consolidated search query is as follows:

(“Robots” OR “Robot” OR “Robotics” OR “Robotic” OR “Swarm” OR “Swarms” OR “Autonomous” OR “Unmanned” OR “UAV” OR “UAVs” OR “CAV” OR “Automated Functions” OR “Automated Driving” OR “Drive Assist” OR “Multi-Agent Systems” OR “Multi-Agent System” OR “Driverless” OR “Self-Driving” OR “ADAS”)

AND

(“Testing” OR “Validation” OR “Verification” OR “Verifying” OR “Verifiably” OR “Assurance” OR “Assuring” OR “Safety Case Analysis” OR “Runtime Monitoring” OR “Metaheuristics” OR “Simulation” OR “SMT Solving” OR “SAT Solving” OR “Constraint Solving” OR “Model Checking” OR “Search-Based”)

The validation process resulted in a total of 7,679 additional and unique papers (i.e., the duplicates from the first search were automatically excluded).

4.5 Overview of the Results

As discussed in Section 4.4, 3,030 papers were obtained as a result of the initial query. As a result of the validation process, we obtained a further 7,679 papers. This leads to a total of 10,709 search results. Our data extraction methodology consisted as follows:

First, we went through the results and filtered papers based on their title; we obtained a total of 1,247 potentially relevant papers. Second, the remaining studies were reviewed by abstract and we applied the exclusion criteria (see Section 4.2), which led to a final set of 195 studies. Third, this final set was coded according to our taxonomy and reviewed in detail as a part of this survey. Figure 2 shows a summary of the number of published articles clustered by year of release. We notice a steady yearly increase of studies included in our review.

Fig. 2.

5 Results

In this section, we present the results of coding the literature in our taxonomy. We structure our results in terms of the four research questions. Regarding RQ1, we present the results concerning the different property specification languages and modelling languages and frameworks used for testing TAS. Regarding RQ2, we review the metrics used to measure the effectiveness, efficiency, and adequacy of testing interventions as well as the quality of systems under test. Regarding RQ3, we code the tools used to implement different interventions as well as any tools implementing the interventions themselves. Regarding RQ4, we present the evidence provided for applicability of the interventions in terms of the case studies and benchmarks used to evaluate the interventions.

5.1 RQ1: Models

In this section, we review the type of models and formalisms that are used for describing the behaviour of robotics and autonomous systems and their properties in testing interventions. Tables 1 and 2 show an overview of results of coding for models used in the studies included in this survey. We classify models according to their semantics (i.e., formal or informal), the domain in which they are employed (i.e., agnostic or domain-specific), and type (i.e., qualitative or quantitative).

Table 1.

Table 2.

We consider a model to be quantitative if it can represent measurable quantities such as probabilities or real-valued entities. Otherwise, the model is considered qualitative. This classification applies regardless of whether the results of the evaluation or the testing technique applied on the model is qualitative or quantitative.

5.1.1 Modelling Properties.

Table 1 presents the models that have been used to represent properties and the studies that employ them. Among all studies included in our survey, less than one-third use a model or logic to describe the properties of the subject systems. For this set of studies all models are classified as formal. Among those, we notice that over two-thirds employ logics to describe qualitative properties of systems [8, 9, 17, 19, 22, 23, 23, 41, 41, 64, 64, 69, 71, 75, 77, 81, 103, 107, 110, 119, 120, 121, 122, 134, 136, 140, 158, 170, 175, 197, 203, 216, 220, 221, 222]. Linear temporal logic, first-order logic, and epistemic logic are examples of such logics that have been used in this set of studies. The remaining studies employ logics that can describe quantitative properties, e.g., describing stochastic or temporal aspects of systems [7, 8, 16, 37, 63, 87, 94, 110, 135, 137, 165, 171, 201, 234, 235]. We notice how there is a lack of languages that cater for specific domains; all property languages found in our survey have been domain-agnostic.

A review of the results presented in Table 1 shows there is a limited number of studies that consider analysis of properties of systems formulated using formal logics. Furthermore, quantitative properties are considerably less represented in the selected studies. Properties to verify stochastic, continuous, and temporal aspects of the systems should play an important role when testing complex and real-time systems, such as RAS. This gap emphasises the need for quantitative logics that are tailored for the domain.

5.1.2 Modelling System Behaviour or Structure.

In Table 2, an overview of models used for describing the behaviour or structure of robotics and autonomous systems is provided. Close to half of all of the included studies in this survey employ system models in their testing strategy; mathematical and rigorously defined models, i.e., formal models, are used in most of such interventions.

For instance, Petri nets and a variety of their extensions [10, 19, 75, 191, 230], labelled transition systems and some of their extended versions [12, 37, 94, 137, 189], finite state machines and their extensions [93, 140, 140], and Markov chains [18, 157, 171, 199, 234, 235] are examples of such models. One observation is that, among studies that use informal description of systems, models that are used in Gazebo and on ROS are more commonly used [13, 14, 42, 42, 51, 102, 117, 127].

Some studies employ a combination of models throughout their testing intervention; in particular, for some higher-level models, lower-level models can be used to specify their semantics [47, 156].

Of the studies that consider a behavioural model of their subject systems, around one-third utilise qualitative models. Most of such models are employed in formal verification strategies, where correctness is evaluated via mathematical proofs or model checking. The remaining studies use models that describe different quantitative aspects of systems such as temporal and stochastic behaviour, e.g., using variations of Petri nets (e.g., stochastic and coloured) [10, 19, 75, 132, 190, 191, 230], probabilistic timed automata [12, 137], and Markov chains [18, 157, 171, 199, 234, 235]; system dynamics, using differential equations [6, 54, 70, 129, 139, 148, 165, 202], hybrid automata and their extensions [39, 40, 82, 225, 226, 227], functional mockup units [1] and various informal simulation models for dynamical systems [13, 14, 42, 51, 102, 117, 127, 172, 200, 232].

Compared to studies before 2019, we notice that there has been an increase in the use of stochastic models (from 4\(\%\) to \(14\%\)). However, this number is still relatively small given the innate probabilistic aspects observed in RAS; hence, this might indicate the need for further stochastic models that are tailored for the domain. Furthermore, we observe a prevalence of qualitative models, despite the importance of quantitative aspects in the behaviour of RAS.

5.2 RQ2: Effect

In this section, we review two different types of measures: The first type of measures coded and reviewed in this section are those measures used for evaluating efficiency, effectiveness, and coverage of the various testing interventions. The second types of measures are the measures of quality used in testing the subject system; by reusing the terminology, we classify them under efficiency (i.e., concerning timing and resources) and effectiveness (i.e., concerning safety and quality) of the subject system.

5.2.1 Measures for Interventions.

Table 3 provides an overview of our coding of these measures, classified into efficiency (testing time or resources), effectiveness (testing quality), and coverage (testing adequacy). It is remarkable that about one-third of the papers included for this survey used a measure of efficiency, effectiveness, or coverage to evaluate their results. This shows a significant gap in using well-defined measures to evaluate and compare various interventions.

Table 3.

Measure		References
Effectiveness	Accuracy of the image recognition (failure rates)	[199]
	Hypervolume in fixed time (search-space coverage in time)	[25]
	Feature interaction failure	[2]
	Distance-based surprise adequacy	[112]
	Number and probability of faulty scenarios generated	[155]
	Reachability	[7, 18]
	Number of test cases	[113, 208]
	Number of failures	[175]
	Number of counter-examples	[205]
	Accuracy of the simulation	[198]
Efficiency	Precision	[159]
	Generational distance in time (distance to Pareto-optimal solutions in time)	[25]
	Testing (est. gen. and exc., model checking) time	[72, 77, 91, 102, 110, 121, 136, 141, 142, 147, 165, 167, 168, 182, 189, 205]
	Test case generation time	[13]
	Test execution and simulation time	[16, 32, 67, 70, 87, 117, 160, 171, 192]
	Reduced test case execution time	[27]
	Testing cost (€/ km)	[35]
	State-space size	[5, 10, 15, 72, 77, 110, 122, 135, 182, 189, 201, 221, 222, 234]
	Search time	[53]
Coverage	Hypervolume	[25]
	Structural coverage metrics (state, code, function, transition, path coverage)	[10, 13, 14, 58, 93, 132, 190, 191]
	Feature interaction (e.g., pairwise and n-wise coverage)	[2, 33]
	Neuron coverage	[211]
	Surprise adequacy coverage	[112]
	Situation (graph) coverage	[143]
	Requirement	[197, 209]
	Diversity	[159]

Table 3. Classification of Measures Considered in Testing Interventions into Efficiency (Testing Time or Resources), Effectiveness (Testing Quality), and Coverage (Testing Adequacy)

It is also noteworthy that the interventions were measured against a vastly different range of measures. Apart from some very basic notions of efficiency (testing time or state-space size) [13, 27, 53, 102, 110, 121, 147, 165, 171, 189, 192] and coverage (such as state and transition coverage) [10, 13, 14, 58, 93, 132, 191], most other notions are only used for a single intervention. This emphasises the need for coming up with domain-specific and more sophisticated notions of efficiency, effectiveness, and coverage that are used for benchmarking and comparing various interventions. Some exceptions that concern domain-specific measures are hypervolume (as a domain-specific measure for the searched space) and generational distance (as a measure of distance from optimal solutions) [25], cost of testing for autonomous vehicles in Euros per kilometre [35], feature interaction coverage [2, 33], situation coverage [143], and neuron- [211] and surprise adequacy [112] coverage.

5.2.2 Measures for Subject Systems.

In this section, we review the measures of quality for the system under test that are used in various interventions, presented in Table 4. Unlike the previous section, there is more prevalence of domain-specific measures; two commonly used measures are spatial distance from the intended trajectory (and variants thereof) [30, 37, 126, 127], collisions and obstacle avoidance [19, 25, 63, 130, 137, 140, 163, 213]. The remaining measures are sparsely used across many different interventions.

Table 4.

Measure		References
Effectiveness	Probability of time to collision	[128]
	Performance and safety properties	[70, 176, 220]
	Safety for human operators	[127]
	Satisfied performance properties w.r.t. number of robots	[12]
	Number of failures	[176]
	Requirements satisfaction	[37, 214]
	Spatial deviation of intended behaviour	[30, 32, 37, 50, 78, 126, 127]
	Endurance distance and stairs traversal of robots	[105]
	Accuracy of the image recognition	[199]
	Collisions & obstacle avoidance	[11, 16, 19, 25, 49, 54, 63, 130, 137, 140, 163, 213, 238]
	Stability	[213]
	Search depth	[71]
	Throughput	[12]
	Schedulability	[73]
	Positive and supportive interactions towards humans	[150, 158]
	Anthropomorphism measure	[188]
	Number of hazards and risk reduction measures	[216]
	Probability of mission success and failure	[15, 141, 142, 158, 168, 226, 227, 234, 235]
	Formal assertions (deadlock freedom, liveness)	[7, 226, 227]
	Criticality (complexity of scenario and dynamics)	[57]
	Vehicle performance (acceleration, speed, position)	[75, 87, 127, 208]
	Regret (Difference between rewards earned and achievable rewards)	[153]
	Severity of failure	[209]
	Probability of rare events	[166]
Efficiency	Number of collisions over time	[163]
	Time to collision	[113]
	Resource utilisation (e.g., CPU)	[71, 172]
	Network usage	[172]
	Fuel consumption	[70, 137]
	Constraint violation rate	[131]
	Device utilisation	[12]
	Response time	[12]
	Training time	[32]
	Latency	[160]
	Idle time	[6]
	Task completion time	[83, 158]
	Time for hazard identification and risk reduction	[216]
	Median miles to next disengagement	[236]
	Battery life	[234]

Table 4. Classification of Measures of Quality for the Subject Systems Used in Testing Interventions

5.3 RQ3: Tooling

We gather and describe the tools that have been employed and introduced among included studies. We categorise tools as context and effect tools; a context tool is one that has been employed by the intervention but it is not a byproduct of its respective work. Effect tools, however, are the tools that have been developed by the academic community in our list of selected papers.

5.3.1 Context Tools.

As shown in Table 5, tools for simulation are among the most utilised; their usefulness comes from a less costly method of visualising whether the design and process are satisfactory. The middleware ROS [177] combined with the 3D simulator Gazebo [115] form the most popular tool for robotics simulation. Furthermore, Simulink [65], a graphical extension of MATLAB [151], is the most used tool for modelling and simulation of dynamic systems.

Table 5.

In the context of autonomous vehicles, traffic simulators such as SUMO [21] and SYNCHRO [104] have also been employed by included interventions, along with vehicle simulators such as CarMaker [44] and Autoware framework [109].

Moreover, tools for formal verification are also extensively used, with model checkers being the most prominent type. The statistical model checker, Prism [123], provides modelling and analysis of systems of stochastic nature modelled in Markov chains or probabilistic automata. As for qualitative models, UPPAAL [125] offers formal verification for timed automata models that can be, however, extended to employ data types.

5.3.2 Effect Tools.

A total of 37 tools, publicly available or otherwise, have been introduced by the academic community as an effect of their intervention. Seven of them were not accessible at the time of writing this survey and were classified as private, including SSIM [117], which is a tool for simulating flight software employed in Mars Rover projects. The remaining tools, a total of 30, are available for the general public; 27 of those also have the source artefacts made public and have been classified as open-source. In Table 6, we also included the specific licence, if any, that the tools are under. We note how the source code of some tools were made available without a licence being specified. In this case, certain repositories, such as GitHub, consider that default (US) copyright laws apply.

Table 6.

References	Name	Description	Availability
[211]	DeepTest	Testing of DNN-driven autonomous cars	Open-source (GPLv3)
[165]	APEX	Formal verification of autonomous vehicle trajectory planning	Private
[75]	Translation tool	Translation tool from GenoM to Fiacre	Private
[232]	Roadview	Traffic scene simulator for Autonomous Vehicles	Private
[46, 47, 48, 156]	RoboTool	Formal verification and simulation of robots	Public (No licence found)
[139]	MAV3DSim	Simulation platform for UAV controllers	Public (No license found)
[4]	Florida Poly AV Verification Framework (FLPolyVF)	Verification of the decision-making of autonomous vehicles	Open-source (MIT licence)
[117]	Simulator in Julia	Robots simulation	Public (No license found)
[52]	Stonefish	Simulation tool for marine robots	Open-source (GPLv3)
[67]	GzUAV	Framework to run multiple-UAV simulations in Gazebo	Public (No license found)
[54]	Move	Suite of tools to test autonomous vehicles	Open-source (GPLv3)
[215]	SSIM	Simulation of flight software	Private
[6]	IMPROV	Tool for self-verification of robots	Public (No license found)
[16]	VerifCar	Framework for validation of decision policies of communicating autonomous vehicles	Public (No license found)
[18]	MCpMC	Statistical model checking of pMC	Open-source
[160]	Asynchronous Multi-Body Framework	Simulation of multi-body systems	Public (No license found)
[53]	RobTest	Tool for stress testing of Single-arm robots	Private
[78]	AsFault	Test case generation for self-driving cars	Public (No license found)
[233]	CyberEarth	Simulation of robots and cyber-physical systems	Public
[37]	Argos	Multi-physics robot simulator	Open-source (MIT licence)
[63]	Drona	Programming framework for robotic systems	Open-source
[102]	ROSRV	Runtime verification framework for ROS	Public (No license found)
[80]	Hybrid Simulation	3D simulation tool	Public (No license found)
[91]	Spot	Prediction of traffic participants	Open-source (GPLv3)
[100]	FROST*	Modelling and simulation of dynamical systems	Open-source (BSD 3-Clause)
[135]	PSV-CA	Probabilistic swarms verifier	Open-source
[175]	RoVer	Model Checker	Open-source (BSD 3-Clause)
[98]	Formal	Modelling and symbolic execution of CPS	Private
[148]	UUV	Gazebo extension for underwater scenarios	Open-source (Apache-2.0)
[184]	V-REP	Robots simulator	Open-source (Commercial or GPLv3)
[212]	MARS	Simulation environment for marine swarm robotics	Open-source (BSD 3-Clause)
[77]	Cruton	Translation from robotics DSL into NuSMV	Open-source (GPLv3)
[159]	Range Adversarial Planning Tool (RAPT)	Test scenarios generation	Public (No license found)
[195]	Pegasus	Autonomous vehicles simulation	Private
[198]	AirSim	Drone simulation environment	Public (Commercial licence)
[224]	Cosina	Simulation of real-time robotics systems	Public (No license found)
[136]	MCMAS	Multi-agent systems model checker	Public (No licence found)

Table 6. Tools Introduced by Studies Included in This Survey for Testing RAS

Analogously to context tools, we notice a focus on development of tools for simulation and model checking. Tools for testing vehicles are among such tools, including in road [4, 54, 78, 165, 232], aerial [67, 139], and maritime [52] environments. As for robots, RoboTool [48] and Improv [6] offer formal verification alternatives for testing robots, while ROSRV [102] provides a ROS extension for verification at runtime.

5.4 RQ4: Applicability

Table 7 provides an overview of the case studies conducted among included papers. We classify them as small, benchmark, and industrial. Case studies designed specifically to evaluate a particular intervention, which lack sufficient details or generality to be employed for a general class of interventions, were classified as small. However, those cases studies that are sufficiently general and contain details to evaluate a range of interventions, provided that they are not used in an industrial context, are categorised as benchmark. Industrial case studies are those real-world (and hence, typically detailed and complex) cases conducted in a industrial setting.

Table 7.

Case studies		References
Small	Pedestrian detection	[81]
	Humanised robots	[230]
	UAV	[17, 18, 30, 46, 56, 60, 93, 133, 233]
	Cleaner agent	[163]
	Self-driving vehicle	[147]
	Sensor system	[35]
	Software functions	[217]
	Family of surgical robots	[149]
	Path planning and decision-making	[51]
	Surveillance drone	[63]
	Lane-changing scenarios	[80, 165]
	Small robot	[62, 64, 218]
	Simple controllers	[19]
	Unmanned Surface Vehicles	[137]
	USAR robots	[10]
	Cooperative forklifts	[132]
	Agricultural robot	[189]
	Cruise control	[110, 130]
	Traffic environment	[131]
	Multi-agent manufacturing controller	[228]
	AR.Drone	[42]
	Cooperative UAVs	[103]
	(Industrial scale) transport robot	[12]
	Platoon	[107]
	Robot swarm	[7, 7, 47, 119]
	iCub robot	[171]
	Collision avoidance scenarios	[4, 49, 128]
	Trained gate controller	[122]
	Autonomous vehicles scenarios	[129, 155, 226, 227, 231]
	Footbot	[48]
	Path following autonomous vehicle	[11]
	Autonomous parking	[50]
	Car following	[50]
	Single arm robot	[53]
	Ultimatum game	[188]
	LEGO EV3 robot	[204]
	Simple robot with LiDAR	[205]
	Border control system	[9]
	Military overwatch missions	[141]
	“AMiResot” robot platform	[146]
	Service robot	[167]
	Re-configurable autonomous trolley	[190]
	Surgical robot	[39, 40]
	Cruise control agent	[72]
	Paint spray robot	[82]
	Group of robots	[86]
	Communicating robots	[158]
	Search mission	[168]
Industrial	ADAS System	[25]
	Self-driving system	[2]
	Emergency response robot	[105]
	Test drive in a test track	[70]
	Mars Rover	[215]
	Automated braking system	[1]
	Lateral State Manager	[197]
	ADAS scenarios	[87]
	Automated Emergency Break	[113, 208]
	NASA benchmark and user case studies	[153]
	RexROV and Desistek SAGA mini-ROV	[148]
	Cartesian impedance Control System in torque mode	[214]
	Care-O-bot	[77]
	Farming	[185]
	Quadrotor with Pixhawk controller	[198]
	Adaptive cruise control	[237]
	Autonomous CoPilot agent	[31]
Benchmark	Autonomous off-road robot RAVON	[176]
	Swarm of robots	[37]
	Two-wheel differential drive robots	[127]
	UAV/Land vehicle cooperation	[33]
	Smores	[213]
	Udacity	[211]
	BERT 2	[13, 14]
	MIT and NIRA datasets	[179]
	Traffic sign database	[199]
	Benchmark	[172]
	Carina I	[58]
	Kobuki robot	[94, 192]
	LEGO EV3 robot	[140]
	RMP400 Robot MANA	[75]
	Landshark	[102]
	Alice autonomous vehicle	[225]
	Parallel delta robot	[36]
	Jack ROV	[126]
	UAV	[161]
	Quadcopter controller	[74]
	Videos of pedestrians and vehicles	[32, 169]
	Traffic wave observations	[54]
	Leader and follower UAVs	[67]
	ROBNAV mobile robot	[73]
	Udacity, MNIST, and CIFAR-10 datasets	[112]
	ATLAS robot	[117]
	Human-robot interactions	[6, 150]
Benchmark	DaVinci research kits	[160]
	Turtlebot 2	[202]
	ZalaZone Smart City Zone	[206]
	Flexible Manufacturing System (FMS)	[216]
	Drone with Pixhawk flight controller	[219]
	WAYMO public road testing dataset	[236]
	Unmanned underwater vehicle (UUV)	[235]
	Windfarm drone	[234]
	Traffic Scenarios	[91]
	ATLAS and DRC-HUBO robots	[100]
	NAO robot	[175]
	NASA’s Unmanned Ground Vehicle	[98]
	Hanse UAV	[212]
	iRobot vaccum cleaner	[15]
	KUKA LWR4+ and the Universal Robots UR5	[34]
	Underwater vehicle	[159]
	Chemical detector	[182]
	COUR-1 robot	[220]
	Care-O-bot	[221, 222]
	COMAN	[224]
	CoCar parking	[238]

Table 7. Classification of Case Studies Considered in Testing Interventions as Small, Industrial, and Benchmark

Our observation identifies a significant gap in industrial evaluation of interventions; only 20 interventions [1, 2, 25, 31, 70, 77, 87, 105, 113, 148, 153, 185, 195, 197, 197, 198, 208, 214, 215, 237] have been evaluated in a industrial context. Understandably, the majority of cases studies have been fully conducted in academic settings. Of those, approximately half made use of small-scale models, which are often not representative of real systems. The other half employed their proposed interventions on large-scale subjects and datasets, including physical systems.

6 Suggestions AND Recommendations to Study Audience

In this section, we analyse the results of the previous sections to identify relative strengths and weaknesses regarding our research questions and for our two target audience groups: researchers and practitioners. We conclude this section by drawing recommendations from our analysis both for researchers and practitioners.

6.1 Analysis

6.1.1 Domain.

Table 8 provides a concise summary of the domains covered by the reviewed interventions. A bulk of reviewed interventions do not pertain to any specific sub-domain of RAS. This indicates a clear gap for subdomain-specific research that considers the characteristics of each of these subdomains and takes them into account in their testing interventions. Most importantly, the subdomain of testing marine and submarine RAS as well as space RAS is under-explored (the only included intervention regarding marine and submarine robots [52, 148, 174, 212] and regarding space robots [153, 215] are not represented in the table for the sake of brevity). We note that there is a recently funded European project REMARO to fill in this substantial gap.²

Table 8.

Below, we analyse the results gathered in Table 8 on a row-by-row basis:

Qualitative

Despite the intrinsic quantitative nature of RAS, qualitative models also play an important role in the verification of such systems, particularly when they are abstracted away to be amenable to rigorous and exhaustive techniques in formal verification. In the case of vehicles, qualitative models abstract away from physical dynamics and rather focus on observable behaviour that can be modelled as discrete events. Overall, we noticed a gap in the qualitative analysis that focuses on aerial vehicles and mobile robots (excluding road vehicles); this is likely due to the challenge of modelling movement without using continuous dynamics.

Road

In the domain of autonomous cars, qualitative models have been used to reason about Human-Machine Interactions [162, 231] and high-level decision-making, particularly regarding ethical concerns [60] and safety [81, 199]. Most quantitative interventions propose use of formal models in their methodology [58, 60, 71, 72, 147, 162, 197, 205, 229, 231]. For instance, Yun et al. [231] propose a strategy that formalises Human-Machine interaction into SysML language to help steer the testing process. In the same vein, Naujoks et al. [162] provide a DSL using a taxonomy of use cases to cover transitions and modes in Human-Machine Interaction interfaces used in the verification process. Sun et al. [205] employ Satisfiability-Modulo-Convex encoding to build finite state abstractions of the learning component and formally verify it. Dennis et al. [60] focus on formalising and verifying ethical concerns in BDI agents. More informal approaches include the use of 3D simulation by describing scenarios in UML notation modelling [208] and the use of graphical notations for safety assurance analysis [81, 199]. A mix of formal and informal models is employed by Heitmeyer et al. [98], where they provide two new tools to be included into their toolset (FORMAL [97]). The first tool synthesises formal models from scenarios written in Event Sequence Charts, and the second tool incorporates a 3D simulation tool (eBotworks [92]) into their toolset.

Aerial

In the aerial domain, there are only a handful of interventions that use qualitative models for testing of aerial vehicles, mostly regarding safety and security concerns. For instance, linear temporal logic is used more than once to formalise safety assurance cases [17, 41]. Hagerman et al. [93] make use of finite state machines to extract security test suites, and Bhattacharyya et al. [31] focus on formally verifying the boundaries beyond which the agent are designed to operate by translating models from a cognitive architecture (Soar [124]) into UPPAAL.

Mobile

Only two studies have been found in this category. Andrews et al. [10] model autonomous systems and their environment using Petri nets to generate test cases and apply their technique to a case study in the human-robot interaction domain. Furthermore, in the context of software product lines, Mansoor et al. [149] conduct a case study on a family of surgical robots by employing formal analysis, feature modelling, and testing; they discuss the key challenges and lessons learned from the case study.

Generic

Most of the included interventions in this category concerned abstract representations of multi-agent autonomous systems and provided efficient algorithms for parametric (formal) verification or state-space reduction techniques [8, 22, 23, 24, 119, 120, 121, 122, 134, 136, 167, 187]. Similar to the previous item, most of the interventions used Linear Temporal Logic or variants thereof to model safety properties [88, 222]. Formal modelling and verification of human-machine interaction is also a common theme in this category [69, 221, 222].

Quantitative

We see that quantities such as time (representing the real-time behaviour), probabilities (representing abstractions in communication networks or choices made by human actors), and physical dynamics (such as velocity and acceleration) are used for testing in various domains. Most of the developed techniques, such as compositional verification techniques or evaluation of intervention, do not pertain to domain-specific instances of these quantities and consider general multi-agent and robotic systems. Below, we review domain-specific interventions as well interventions that are developed for the general context of RAS, structured by the columns in Table 8.

Road

In this domain, we see a strength in integrating code-level abstractions (e.g., for individual components or functions) with system-level specifications (for vehicles and fleets of vehicles and using); such integrations are then tested using simulation and formal verification frameworks. The quantitative models used for such testing intervention often pertain to vehicle dynamics and also probabilities arising from communication frameworks. AbdElSalam et al. [1] present a framework for verification of ADAS and autonomous vehicles that uses SystemC TLM models for virtual ECUs. Transaction-Level Models provide a high-level abstraction of the SystemC components that are used in virtual ECUs. These models are then integrated with the vehicle and traffic models for simulation. Parametric modelling of CAVs as a network of timed automata is used by Arcile et al. [16]. In this work, VerifCar tool is applied to asses the impact of communication delays on the decision algorithms of CAVs and to check the robustness and efficiency of such algorithms. Similarly, variations of extensions of timed-automata, probabilistic timed automata, and stochastic timed automata are used [12, 137, 226] for modelling the behaviour of autonomous vehicles to verify properties of different decision-making and collision avoidance algorithms. Barbot et al. [19] use statistical model checking to verify an autonomous vehicle controller where the controller is specified in C++. A set of safety properties specified in HASL, a quantitative variant of linear temporal logic, are verified for a controller. Betts et al. [30] compare the effectiveness of two search-based testing methods, genetic algorithms and surrogate-based optimisation, for test case generation for UAV flight control software. There are several works in this category that provide and use simulation platforms [51, 51, 70, 87, 128, 131, 133, 172, 206].

Aerial

In this domain timing information Human-machine interaction is formalised (through a formalisation of a cognitive architecture) in terms of networks of timed automata, and UPPAAL is subsequently used for their verification. [18, 33, 42, 63, 103, 139, 161, 201, 234]

Mobile

There are a few studies in this domain that consider quantitative models. Among these studies, the majority employ formal models [12, 37, 137]. Statistical model checking is one of the techniques that is used [12] to verify performance of transport robots based on behavioural models, stochastic timed automata, using UPPAAL SMC. Furthermore, model checking of Markov models is used [37] to verify PCTL properties of swarm robotics behaviour in the design phase. The models then are used as a blueprint for implementation and simulation. Probabilistic model checking of unmanned surface vehicles is another technique used [137]. PRISM model checker verifies PCTL properties of USVs on probabilistic timed automata as behavioural model. Other work in this category [176] use a designed DSL (graph-based models) to describe the system behaviour that is used as test model for generating test cases.

Generic

As a general observation a considerable portion of all interventions (23% of papers) in this category have been an out-of-the-box application of model checking tools (mostly using PRISM [123], in some cases UPPAAL [125] and FDR [85]) to specify small-scale robotic case studies; another prevalent category (30% of papers) concerns theoretical papers on various logics and model-checking algorithms for multi-agent robotic systems. Notable exceptions of this general theme include languages and toolsets for rigorous simulation [47, 48, 156, 182] using formal verification as part of the design of robotic interaction protocols [64, 228], using formal verification to analyse human-robot interaction [69, 158, 175, 216, 221, 222], and generating test cases from formal models [14, 46, 132, 163, 192]. Another interesting intervention concerned comparison of different hybrid-systems solvers [39, 40] for formal verification of robotic applications. Also, the theorem prover Isabelle/HOL has been used to formalise safety assurance claims [88]. Also, model checking has been used to train policies in reinforcement learning [171]. Compositional techniques has been used to reduce the complexity of the model checking problem [170, 189]. Another noteworthy attempt is in formalising and verifying ethical concerns [60, 62].

Formal

In summary, there is a relative strength in theoretical foundations of testing, validation, and verification, comprising various logical and specification formalisms and small-scale proof of concept exercise in model checking abstractions of RAS. There seems to be a recent trend, identified below, in analysing human-machine interactions. There is a relative weakness in non-exhaustive testing, validation, and verification techniques and studying and improving their applications to large-scale and industrial systems. To give a more nuanced analysis, we distinguish our analysis further based on the domain:

Road

There are a number of non-exhaustive testing from formal models [133, 208] and scenario generation [157] techniques proposed in this domain. Model synthesis from scenarios has also been studied [98]. The applications of different verification techniques such as formal verification based on supervisory control, model checking, and deductive verification have been studied in industrial context [197]. Also in this domain, a framework for validating ethical policies has been developed [229], and human-machine interaction for user interfaces have been validated [231].

Aerial

Apart from applying traditional model checking [103, 201] and simulation [56] techniques to this specific domain, we observe notable attempts to combat the huge state-space of domain-specific models by employing statistical model checking [18] and runtime monitoring [63]. Most papers in this domain have focused on constructing safety models/properties or even coming up with safety specification frameworks [17, 41]; however, energy-efficiency [234] and to a limited extent, security [93], have been addressed as well.

Mobile

The landscape in this domain is much more sparse and scattered. As usual for this category, there are a number of applications of model-checking tools. There is a single paper on test description and test-case generation [176]; also, variability is an under-studied aspect in robotics that has been handled in this context [149] and, finally, there is an industrial case study on the application of model checking [12].

Generic

Formal verification is prevalent in this category; besides parametric verification of multi-agent systems using variations of epistemic logic [8, 22, 24, 121, 135, 187], formal verification using timed automata has also been used in a strategy that composes verification problems into smaller ones [189] and in the verification of path planning [7]. Furthermore, Araiza-Illan, Pipe, and Eder [14] use BDI models and model checking of probabilistic timed automata (in UPPAAL) to generate test sequences for human-robot collaboration tasks. Another use of UPPALL in this category is for model checking of ROS applications that makes use of an ad hoc translation from ROS to UPPAAL [94].

Verification of probabilistic aspects can be found in a few studies. Zhao et al. [235] employ Bayesian inference to estimate the distribution of parameters of Markov chains. Then, they combine formal verification, synthesis, and runtime monitoring to check that the estimated parameters are not violated. Pathak et al. [171] make use of probabilistic properties of Markov chains for self-repair capabilities in robots and tie those into a formal verification process (of PCTL formulae). Araujo, Mota, and Nogueira [15] apply probabilistic model checking to verify whether a robot trajectory (described in terms of an algorithm) satisfies specific behaviours or properties (stated as temporal formulas).

The combination of CSP and FDR can be seen in a few studies. Cavalcanti et al. [46] generate mutations on RoboCart models and feed the mutated CSP [101] (obtained from the RoboChart models) into FDR, which yields a counter-example that is used for testing. Yueng et al. [228] detail a process to contribute to the design of simulation experiments by analysing variations of timing parameters in CSP. They show that the simulation experiment can yield different results in performance and in behavioural terms. Sumida et al. [204] demonstrate a case study in which they model a LEGO robot (EV3) and verify it in FDR for deadlock and livelock.

Several other studies employ different strategies to achieve different goals. Gainer et al. [77] synthesise formal models from control rules of robots in a DSL and input those models into NuSMV for model checking. Doan et al. [64] employ MAUDE to formally verify gathering of robots in a RING network. Santos et al. [192] work on generating unit testing for ROS components and property-based testing for ROS using the Hypothesis tool [154]. Bresolin et al. [39, 40] apply reachability analysis of Hybrid Automata in ARIADNE [26] to analyse the dynamics of surgery robots.

Informal

In this category, there is a clear strength in generic simulation tools and architectures, followed by a strength in simulating road vehicles with domain-specific kinematic models. Some interventions focus on test-case generation and prioritisation as well as runtime monitoring both for generic RAS and for road vehicles. There is a clear weakness in domain-specific intervention for aerial vehicles, where there are only simulation tools for individual and connected vehicles reported in the literature and mobile robots (excluding road vehicles), where no intervention is included in our review. The simulation tools in various domains are often based on a combination of ROS [177] and Gazebo [115], Unity [210] and/or USARSim [45]. We refer to Section 5.3 for further explanation of these tools.

Road

A majority of papers in this category introduce a simulation tool [1, 51, 54, 70, 129, 155, 206, 232] combining vehicle kinematics with other aspects of vehicle modelling such as communications [51], vision-based algorithm [129], and fuel consumption [70]. There are two interventions that use search-based testing [25, 30]. Surrogate modelling, where a higher-level model is used to steer the search, is used in both approaches. Another approach uses past data to identify challenging situations and embed them into test-cases (using an XML structure) [200].

There are also a number of process interventions describing a process for safety assurance [81, 236] and testing Human-Machine Interfaces [162]. Some papers do use a well-defined syntax or a mathematical notation, but are classified as informal; in our classification, if a model does not have a rigorous formal syntax, semantics, and reasoning method, then it is classified as informal. These include using XML as a formal model [200], mathematical descriptions of vehicle kinematics (see simulation tools above), and probabilistic descriptions for risks [236].

Aerial

All interventions reported here concern simulation tools for modelling dynamics and control [42, 139, 161] and communication of aerial vehicles [67]. It is remarkable that these simulation tools rely on entirely different context tools, which will be analysed further in RQ3.

Generic

In this category, the majority of interventions again propose simulation tools [117, 148, 184, 202] or a simulation architecture [224]. (Note that there are two simulation tools that address marine robots [148, 212], but since we did not have a separate class for such robots, we classified it here.) There are, however, a couple of interventions concerning test-case prioritisation [127] and automated unit-test execution [34], and runtime monitoring [102]. The runtime monitoring environment [102] provides an integration with ROS implementations.

Effectiveness

Here, we summarise the notions of effectiveness used in different domains; these notions comprise two sub-categories: the effectiveness of the RAS under test, which is the oracle or the property against which the RAS is tested, and the effectiveness of the testing techniques, which provides an evaluation of the techniques, rather than the system under test:

Road

Concerning the effectiveness measures for road vehicles, collision analysis is the most prominent metric, including analysis on the number of collisions [11, 16, 49, 54, 113, 130, 227], probability of collision [19, 128], and severity [209]. Furthermore, a few studies focus on the analysis of deviation from the intended path in terms of spatial and rotational deviation [30, 32, 50, 78]. As for measures for test adequacy, probability of faulty and rare events [155, 166], and the number of tests generated [208] have been studied; however, these latter measures are not domain-specific and similar measures have been used in other sub-domains.

Aerial

Only four studies have been included in this category. As measures of SUT efficiency, Desai et al. [63] measure obstacle avoidance and plan execution. Similar to road vehicles, studies on probability of completing a task for aerial vehicles are found in the literature [18, 234]. Zhao et al. [234] take a further step by analysing expected mission time and expected number of battery recharges during a mission. Conversely, as a test adequacy metric, Reference [198] measures the accuracy of its simulation.

Mobile

With respect to mobile robots, only a handful of papers’ collective metrics related to effectiveness. As in other sub-domains, collision avoidance [137] and probability of satisfying requirements [37] can also be found here. Brambilla et al. [37] also measures the improvement in the behaviour of a robot by analysing the number of objects retrieved. Arai et al. [12] also consider a similar measure of improvement in the behaviour but in terms of device utilisation. The only metric regarding testing adequacy is number of failures detected studied by Proetzsch et al. [176].

Generic

As a unique measure of effectiveness, Ruijten [188] detects anthropomorphism in its subjects, which is a measure of human likeliness in robots. Several studies focus on analysing the probability of completing a mission successfully [15, 141, 158, 235]. Studies on safety [7, 163, 216] are also commonly found in this category. For instance, Viventini et al. [216] focus their efforts on analysis of hazards, such as the number of hazards identified, number of types of hazard, and number of risk reduction measures taken.

Efficiency

Similar to effectiveness, measures of efficiency can pertain to the system under test (measuring resource usage as a property to be checked or as an oracle for pass and fail) and the testing techniques (to measure the resources used in testing):

Road

Mullins et al. [159] use precision, convergence, and resolution for efficiency in testing. In a number of studies on verification of autonomous systems [71, 72, 110] the size of the state space as well as the total memory footprint [71] in their evaluation is measured to evaluate efficiency. Sun et al. [205], in the verification of finite state models, use abstraction and verification time to estimate the efficiency in their work. Verification time is used in a number of studies [16, 71, 72, 91, 110, 147, 165] to measure efficiency. Gladisch et al. in Reference [87] use simulation time to measure efficiency. Similarly, Bi et al. use simulation time [32] for measuring the efficiency of their work. Fayaz et al. [70] measure test duration in their evaluation. To evaluate the efficiency of the work, Reference [172] measures CPU usage and network bandwidth. Bode et al. [35] measure the cost (euros) of application of their approach as a notion of efficiency. Li et al. in Reference [131] measure computational time in testing and comparing various autonomous vehicle decision and control systems.

Aerial

Sirigineedi et al. [201] use verification time and number of states in Kripke structures, as system model, for measuring efficiency in their work. Urso et al. [67] use simulation time once using Gazibo as notion of efficiency. Zhao et al. [234], in their work that is using prism model checker, consider the expected measure time as a notion of efficiency.

Mobile

Andrews et al. [10] consider the number of states and transitions as a measure of efficiency for the testing method. Arai et al. [12] present results on the response time, throughput, and device utilisation as measures of efficiency for the system under test.

Generic

Verification time is commonly used in a number of studies [77, 136, 141, 171, 182, 189] to measure efficiency. The number of states is another common notion measured in evaluations [15, 77, 122, 135, 182, 228]. Furthermore, testing and simulation time is used to measure efficiency in a set of studies [83, 117, 192]. Althoff et al. in Reference [6] measure the reduced idle time in self-verifying robots. Munawar et al. [160] measure latency and number of steps in simulation of surgical robots. Search time and optimisation time are used by Collet et al. [53] in evaluation of testing single arm robots. Gerstenberg et al. in Reference [83] measure task completion time in simulation. Vicentini et al. in Reference [216] present the total time for risk reduction and hazard identification in their evaluation of formal verification techniques. Probability of task completion, number of executed instructions, and time for completing the task are other measures used in evaluation of model checking autonomous systems [15]. Muhammad et al. [158] measure time and probability of task completion in probabilistic model checking of cooperative robot interaction.

Coverage

Compared to the measures of efficiency and effectiveness, coverage measures are not as widely adopted.

Road

Variations of structural coverage can be found in studies within this sub-domain. Neves et al. [58] developed a tool that conducts post-analysis (based on meta-models) on outputs collected from field testing of autonomous vehicles. Their tool aims to expand test coverage by exploring functionalities in the meta-model that were not covered during the field testing. Majzif et al. [143] devise a process that guarantees coverage of safety standards. They abstract results from component testing and make use of meta-models and situation graphs to compute a system-wide degree of test coverage and derive new scenarios to cover unexplored situations. Tatar [209] presents a method (implemented in TestWeaver [106]) for testing and validation of ADAS systems. The tool generates scenarios to cover relevant system states and feeds back previous executions to guide the next round of testing.

Aerial

In the only paper in this category, Bicevskis, Gaujen, and Kalnins [33] developed new methods for testing and validation of autonomous processes collaboration. They build a collaboration model using an extended finite state machine and employ symbolic execution and feasibility tree analysis to check that all relevant states can be reached in the model. They have evaluated their strategy using a UAV case study.

Generic

Tian et al. [211] propose DeepTest, a tool for testing Deep Neural Networks in autonomous vehicles. It generates tests that explore different parts of the DNN logic with the goal of maximising neuron coverage. Araiza-Illan, Pipe, and Eder [14] propose a methodology for generating test cases that achieve high code coverage in human-robot collaborative tasks. They developed a testbench for ROS that makes use of belief-desire-intention (BDI) agents to generate valid and human-like tests. Structural coverage of Petri nets models have been utilised by Sagglietti in different contexts, such as in the generation of test cases for autonomous agents [132] and the verification of reconfiguration behaviour of autonomous agents [190, 191].

Open-source

Road

There are a handful of open-source tools for formal verification [16], testing [54], and simulation [4, 80, 211] of autonomous vehicles. Testing and simulation seem to be gaining some strength with respect to open-source tools, and we see tools for scenario generation, testing, and simulation of autonomous vehicles within various traffic scenarios. For connected vehicles, VerifCar [16] is a framework based on timed automata and is dedicated to modelling and verifying and validating connected autonomous vehicles policies.

Garzón and Spalanzani [80] present a tool that combines 3D simulation (for ego-vehicle control) with a traffic simulator (which controls the behaviour of other vehicles). The goal is to test the ego-vehicle in realistic high-traffic situations. The FLPolyVF tool [4] connects functional verification, sensor verification, diagnostics, and industry/regulatory communication of autonomous vehicles while checking the effects of using different scenario abstraction levels. The MoVE tool [54] provides the possibility of modelling pedestrian behaviour. The framework focuses on testing autonomous system algorithms, vehicles, and their interactions with real and simulated vehicles and pedestrians.

Gambi, Mueller, and Fraser present the AsFault prototype tool [78]. The tool combines procedural content generation and search-based testing to automatically create challenging virtual scenarios for testing self-driving car software. Tian et al. [211] propose DeepTest, a tool for testing Deep Neural Networks in autonomous vehicles. It generates tests that explore different parts of the DNN logic with the goal of maximising neutron coverage

Aerial

All new tools reported here concern simulation tools for modelling dynamics and control [18, 139] and communication of aerial vehicles [67].

Lugo-Cárdenas, Luzano, and Flores [139] introduce a 3D simulation tool for UAVs whose focus is on assisting the development of flight controllers. Analogously, D’Urso, Santoro, and Santoro [67] also present a simulator for UAVs, called GzUAVChannel. The framework combines Gazebo, Autopilot, and NS-3 network simulator to provide a 3D visualisation engine, a physics simulator, a flight control stack, and a network simulator to handle communications among unmanned aerial vehicles. On the stochastic side of software verification, Bao et al. [18] present a prototype tool for parametric statistical model checking that can cope with complex parametric Markov chains where state-of-the-art tools (such as PRISM) have timed out. They provide evidence of their tool efficiency by conducting an industrial case study.

Generic

Several open-source tools have been proposed in this category, with a majority of them being simulators. The only exceptions are a formal verification tool [175] for human-robot interactions and two runtime verification tools [63, 102].

Rohmer, Singh, and Freese introduce VREP [184] a popular robotics physics simulator that is now known as CoppeliaSIM. The tool uses a kinematics engine and several physics libraries to provide rigid body simulations (including meshes, joints, and multiple types of sensors). Brambilla et al. have developed ARGOS [37], which is a multi-physics robot simulator that can simulate large-scale swarms and can be customised via plug-ins. In the Matlab environment, the FROST tool [100] is an open-source Matlab toolkit for modeling, trajectory optimisation, and simulation of robots, with a particular focus in dynamic locomotion. Munawar and Fischer [160] present the Asynchronous Framework, which incorporates real-time dynamic simulation and interfaces with learning agents to train and potentially allow for the execution of shared sub-tasks.

For underwater robots, three new tools have been introduced: Manhaes and Rauschenbach present UUV simulator [148], which is an extension of Gazebo accommodating the domain-specific aspects of underwater vehicles. Cieslak et al. introduce Stonefish, a geometry-based simulator [52] that can be integrated with ROS. Last, the MARS [212] tool provides simulation environments for marine swarm robots. As for tools in the human-robot interaction (HRI) domain, RoVer [175], provides visual authoring of HRI, formalisation of properties in temporal logic, and verification that the interactions abide by a set of social norms and task expectations.

Huang et al. present ROSRV [102], which is a runtime verification framework that can be used with ROS. Desai et al. [63] present a runtime verification framework based on Signal Temporal Logic [144], where an online monitor checks robustness on partial trajectories from low-level controllers (in the context of surgical robots).

Public

The tools included in this category are diverse; we report below on a test-scenario generation tool for road vehicles [159], two simulation tools for aerial [198] and generic [233] robots, and a formal verification tool for generic robots (applied to a UAV case study) [46, 47, 48, 156].

Road

Mullins et al. [159] developed a tool (RAPT - Range Adversarial Planning Tool) for generating test scenarios. The tool employs an adaptive search method that generates new scenarios based on the performance and results of the previous one. A clustering algorithm ranks the scenarios based on the performance type and how close they are to the boundaries of each cluster. The boundaries are based on notions of efficiency, diversity, and scaling.

Aerial

Shah et al. [198] introduce the AirSim simulator that generates training data for building machine learning models used in autonomous air crafts. It offers physical and visual simulation, including models of physics engine, vehicle, environment, and sensors.

Generic

There are two tools included in this category: a formal specification and verification tool [46, 47, 48, 156] and a simulation tool [233]. Cavalcanti et al. [46, 47, 48, 156] introduce RoboTool, supporting graphical modelling, validation, and model checking (via FDR [85]) of robotic models written in RoboChart [156] and RoboSim [48]. Zhang et al. [233] introduce CyberEarth, a framework for program-driven simulation, visualisation, and monitoring of robots. The tool integrates modules from several other open-source tool such as ROS [177] and OpenSceneGraph (OSG) [43].

Proprietary

Road

The three tools included in this category comprise two tools for formal analysis [98, 165] and a simulation tool [232], each of which are explained further below.

Heitmeyer and Leonard [98] introduce two tools integrated into the FORMAL framework; The tools synthesise and validate formal models. The first tool synthesises a formal Software Cost Reduction (SCR) requirements model from scenarios, and the second tool combines the existing SCR simulator [96] with eBotworks 3D simulator to allow for simulation of continuous components.

O’Kelly introduces APEX [165], which is a tool for formally verifying the trajectory planning and tracking stacks of ADAS in vehicles. Zhang et al. present RoadView [232], a photo-realistic simulator that tests performance of autonomous vehicles and evaluates their self-driving tasks.

Generic

The three tools included in this category are diverse and range from simulation [215] to formal verification [75] to model-based testing [53].

Verma et al. [215] present a Flight Software simulator that is used to simulate MARS Rover missions. The simulator assists in predicting the behaviour of semi-autonomous systems by providing the capability for human operators to check if their intent is correctly captured by the robot prior to execution in different scenarios and environments. Foughali et al. [75] implement an automatic translation from GenoM [145], a robotics model-based software engineering framework, to the formal specification language Fiacre [28], which can be fed into TINA [29] for formal verification. Collet et al. [53] introduce RobTest, a tool for generating collision-free trajectories for stress testing of single-arm robots. It employs constraint programming techniques to solve continuous domain constraints in its trajectory generation process.

Small

A considerable number of studies, among those included in this survey, consider small case studies in their experiments. Here, we review the most prominent ones.

Road

A few papers employ case studies with a focus on collision avoidance for road vehicles [4, 49, 71, 128]. They all focus on detecting imminent collision using built-in sensors. Similarly, Gaurehof, Munk, and Burton [81] conduct a case study where a machine learning function detects pedestrians using a video analysis.

Several case studies concentrate on driving scenarios and manoeuvres, such as lane changing [19, 165], lane and path following [11, 232], merging [80], roundabouts [227], traffic scenarios [131, 155], parking [50], and overall cruise control [72, 110, 130, 147, 217]. A focus on the actual decision-making and path planning can be seen in a handful of the case studies [50, 51, 226] as well. Hardware-in-the-loop simulations [95] and human-machine interaction for driving assistance [231] can also be seen among the included case studies.

Aerial

The small-scale case studies in this domain only present theoretical or very limited models of UAVs, such as a surveillance drone [63] or a model for UAV launch [93]. Moreover, Brunel et al. [41] conduct a safety case analysis, while Bu et al. [42] explore simulation and realistic testing of vision-based object tracking for UAVs. Aerial drones in co-operative scenarios are also the subject of two case studies [103, 201].

Mobile

Only two small-scale case studies were included here: Lu et al. [137] make use of PRISM model checker to investigate three collision avoidance algorithms in an unmanned surface vehicle model with a dynamic intruder, and Arai and Schlingloff [12] employ a model checking technique on a transport robot model.

Generic

Several case studies can be found within this category, with the vast majority being small models that are applied to demonstrate the respective intervention. Here, we briefly present and discuss some of them.

Walter, Täubig, and Lüth [218] provide an algorithm that increases safety through formal verification using the theorem prover Isabelle; the case study is a small robot. Nguyen et al. [163] provide a multi-step process to verify correctness of autonomous agents and apply it to a cleaner robot. Fu and Drabo [230] model a humanoid robot in an extension of Petri nets (called Predicate Transition Reconfigurable Nets - PrTR Nets) and formally verify it. Lill et al. [132] also make use of Petri nets, however, they develop models of cooperative forklifts and simulate scenarios where the robots decide which one has the priority when passing through narrow pathways.

Farulla and Lamprecht [69] conduct a case study on human-robot interaction processes that have been modelled in DIME and show how they can be verified with the GEAR model checker. Zhang et al. [233] have built a virtual simulation platform, CyberEarth, for robotics and cyber-physical systems. A visual coverage task for UAVs is also introduced to demonstrate the platform. Dennis and Fisher [62] apply an agent verification approach to verify the correctness of an agents ethical decision-making. Doan, Bonnet, and Ogata [64] specify and formally verify, using the model checker Maude, a robotic gathering model.

Industrial

Road

The industrial case studies involving road vehicles included in our survey typically involve verifying specific components of such systems.

In the context of advanced driver assistance systems (ADAS), Abdessalem et al. [25] generate test cases for such a system that can visually detect pedestrians. Zhou et al. [237] introduce a framework for virtual testing of advanced driver assistant systems that uses real world measurements. Kluck et al. [113] consider virtual driving scenarios for testing automated emergency breaks. AbdElSalam et al. [1] use Hardware Emulation-in-the-loop to verify Electronic Control Units (ECUs) for ADAS systems.

Fayazi, Vahidi, and Luckow [70] implement a vehicle-in-the-loop verification environment and conduct field testing in the International Transportation Innovation Center (ITIC).

Gladisch et al. [87] select case studies that use industrial automated driving (adaptive cruise control, lane keeping, and steering control scenarios) to evaluate their search-based testing strategy. Abdessalem et al. [2] generate test cases for the SafeDrive system, which contains the following four self-driving features: autonomous cruise control, traffic sign recognition, pedestrian protection, and automated emergency braking.

Aerial

Shah et al. [198] build a model of quadrotor with pixhawk controller in their newly developed simulator, AirSim, that includes a physics engine and supports real-time hardware-in-the-loop. Rooker et al. [185] demonstrate their validation framework for autonomous systems in a farming context with simulations and field testing. They employ both UAVs and ground mobile systems. Bhattacharyya et al. [31] apply formal verification methods to an autonomous CoPilot agent.

Mobile

The only study in this category is the study conducted by Rooker et al. [185], which is also mentioned above. In summary, they demonstrate their validation framework for autonomous systems in a farming context with simulations and field testing.

Generic

Jacoff et al. [105] conduct field testing for the performance evaluation of robots used in disaster scenarios. Verma et al. [215] present a Flight Software simulator that is used to simulate MARS Rover missions. They demonstrate their approach with a case study. Satoh [194] conducts a case study using a physical transport robot to demonstrate their framework that can emulate the robot’s physical mobility.

Manhaes and Rauschenbach [148] model the Sperre SF 30k ROV underwater robot (RexROV) in the demonstration of the simulator for unmanned underwater vehicles. Uriagereka et al. [214] conduct simulation-assisted fault injection to assess safety and reliability of robotic systems. The feasibility of their method is demonstrated by applying it to the design of a real-time cartesian impedance control system. Gainer et al. [77] conduct a case study in the context of verification of human-robot interaction using the Care-O-Bot robotic assistant.

Benchmark

Road

Many different case studies have been included in this category. We briefly discuss the most distinguished ones. For instance, Neves et al. [58] developed a tool that conducts post-analysis (based on meta-models) on outputs collected from field testing of autonomous vehicles. Five field testings involving a program to control the navigation of an autonomous vehicle, CaRINA I, were performed. Zofka et al. [238] present the framework Sleepwalker for verifying and validating autonomous vehicles and demonstrate the benefits of their framework using different instances stimulating an autonomous vehicle.

Mullins et al. [159] have developed a tool (RAPT - Range Adversarial Planning Tool) for generating test scenarios to be employed on the System Under Test. Their tool is applied to realistic underwater missions. Heitmeyer et al. [98] synthesise software cost-reduction models of multiple autonomous systems to be used in a simulator integrated with the eBotworks simulation tool. Gruber and Althoff [91] present a reachability analysis tool (Spot) that finds counter-example to property violations. Their tool is evaluated using the CommonRoad benchmark PM1:MW1:DEU_Muc-3_1_T-1.

Pereira et al. [172] employ several small case studies in their attempt to couple two simulators, namely, SUMO and USAR-Sim. Pasareanu, Gopinath, and Yu [170] present a compositional approach for the verification of autonomous systems and apply the technique on a neural network implementation of a controller for a collision avoidance system on the ACAS Xu unmanned aircraft. Bi et al. [32] present a deep Learning-based framework for traffic simulation and execute several scenarios of intersections with and without pedestrians.

Aerial

Not many studies have applied their intervention to benchmarks of aerial systems. Bicevskis, Gaujens, and Kalnins [33] develop models for the testing of UAV and UGV collaboration in the Simulink environment. Mutter et al. [161] also explore the simulation of UAV models in Simulink and discuss the results when combining the platform and environment models. D’Urso, Santoro, and Santoro [67] simulate leader-follower UAV scenarios in their framework. Their goal is to combine four simulation environments: a 3D visualisation engine, a physics simulator, a flight control stack, and a network simulator.

Wang and Cheng [219] present a hardware-in-the-loop simulator for drones that can generate synthetic images from the scene as datasets, detect and verify objects with a trained neural network, and generate point cloud data for model validation. They simulate and conduct field testing on a physical UAV. Zhao et al. [234] model an unmanned aerial vehicle (UAV) inspection mission on a wind farm and, via probabilistic model checking in PRISM, show how the battery features may affect verification results.

Mobile

Two studies fit this category. Proetzsch et al. [176] use a designed DSL (graph-based models) to describe the system behaviour of the autonomous off-road robot RAVON. The model is used as test model for generating test cases. Brambilla et al. [37] model a probabilistic swarm that is checked in PRISM to evaluate their property-driven design method.

Generic

Several studies have applied their intervention to benchmark case studies of generic/immobile robots. We briefly discuss the most distinguished ones.

Tosum et al. [213] present a design framework that facilitates the rapid creation of configurations and behaviours for modular robots. They have demonstrated their framework on the SMORES robot. Halder et al. [94] use the physical robot Kobuki as a case study, over which properties are automatically verified using the UPPAAL model checker. The focus of their approach is to model and verify ROS systems using real-time properties. Laval, Fabrese, and Bouraqadi [127] introduce a methodology to support the definition of repeatable, reusable, semi-automated tests and apply it to a two-wheel differential drive robot.

Bohlmann, Klinger, and Szczerbicka [36] automatically generate a model of a parallel delta robot on-the-fly. Their method for model generation is based on machine learning and symbiotic simulation techniques. Mariager et al. [150] design and field-test a robot that interacts with adolescents with cerebral palsy. Althoff et al. [6] propose a framework (IMPROV) for self-programming and self-verification for robots, which is demonstrated on a physical robotic arm. Wingand et al. [224] have developed CoSiMA, which is an architecture for simulation, execution, and analysis of robotics system. They conduct experiments on the humanoid robot COMAN.

In Table 8, we map the identified subdomains to the different aspects of our research questions as follows:

RQ1

Across all subdomains, a majority of models have been formal and quantitative, and substantial gaps can be detected (most notably in the aerial vehicles and mobile robots subdomain) regarding using qualitative and informal models for testing.

RQ2

Across all studied subdomains, there is a clear gap in using precise notions of effectiveness, efficiency, and coverage. Among these, some generic notions of effectiveness and efficiency (such as testing time and state space size) and the notion of coverage (such as node and transition coverage) are the most-used measure for quantifying the effect. Common, more sophisticated measures of effectiveness, efficiency, and adequacy such as Average Percentage of Faults Detected (APFD) [186] do not seem to have been adopted in or extended to the domain of RAS. We do see some recent trend towards domain-specific notions of effectiveness and coverage [2, 25, 33, 35, 112, 143, 211]; almost all of these notions have been applied to the autonomous vehicles domain, but most of them can be adapted to be applicable to other domains as well.

RQ3

There is a considerable gap concerning tool support for testing RAS. There are very few open source tools, mostly in the autonomous vehicles [4, 16, 54, 78, 80, 91, 211] and aerial vehicles [18, 67, 139] subdomains. No open-source tools support the domain-specific aspects of mobile robotic system. The same pattern with a more severe gap is present for proprietary tools. Very few public (but not open source) tools are developed or used in the reviewed literature.

RQ4

There is also a very severe gap across all subdomains in using industrial case studies for evaluating RAS testing interventions. The most notable exceptions are a handful of case studies, mostly in the autonomous vehicles [1, 2, 25, 70, 87, 113, 195, 197, 208, 237] and the aerial vehicles [77, 105, 148, 153, 194, 214, 215] sub-domains, performed in an industrial context. Many interventions used small case studies, mostly without any specific application subdomain (e.g., using generic models of mobile robots); in these cases, the models did not contain enough details to be part of a general benchmark. There have also been some evaluations performed on small case studies based on drones and UAVs.

Analysis for Researchers.

Gaps: In our analysis of the studied subdomains, there is a clear gap in treating marine and sub-marine RAS. Also, there is a relative weakness in treating aerial vehicles and mobile robots. Moreover, there is a relative weakness across subdomains concerning the treatment of informal and qualitative models. Developing a common set of notions of effectiveness and efficiency to compare different interventions is a worthwhile research challenge, and there is a gap in the literature in tailoring them for specific domains. The same observation holds for the notion of test adequacy. Tooling, particularly tailored for specific subdomains, is a general weakness for all interventions. Moreover, applying the interventions in industrial context is an outstanding challenge.

Strengths: The road vehicle subdomain has considerable strengths across all research questions. Also, there are far more interventions that have been developed for generic RAS systems without treating the specific concerns of sub-domains. Formal and quantitative models are by far the strongest interventions both in terms of the number of techniques studies and the evaluations performed, even in industrial domains.

Analysis for Practitioners.

Gaps: Since most of the proposed interventions have not been evaluated in industrial context, evaluating their applicability, including studying factors such as the learning curve and training, remains a substantial gap.

Strengths: Due to the available strength in formal and quantitative models, developing such models provides a starting point to benefit from the developed and studied interventions. There is certainly more maturity in the area of road vehicles to benefit from in practice, but we can envisage that by tuning the domain-specific aspects; other sub-domains may also benefit from these strengths.

6.1.2 Cooperation and Connectivity.

Verification methods are pivotal for the widespread deployment and public acceptance of autonomous systems. The need for such methods is intensified in the functions enabled by network services, due to the close interaction among the communication protocols, control software (e.g., for cooperation rules), and system dynamics. Existing (manual) analysis techniques typically do not scale to the huge design-space and input-space of these functions and, hence, in this work, we survey automated verification techniques found in the literature.

Table 9 provides an overview of the interventions used to test cooperation and connectivity in RAS. The interventions can be broadly categorised into swarm RAS, where an emerging behaviour is to be observed through cooperation of a large number of RAS, versus cooperative RAS, where few RAS units engage in a well-defined interaction (possibly with their environment) to achieve a goal.

Table 9.

In general, this turns out to be an understudied area of testing RAS, and little focus has been put in testing cooperative and connected scenarios in the literature. For the very few interventions reported in the literature, there is scarcely any evidence of efficiency or effectiveness available. The handful of reported evaluations are only performed on small-scale case studies and are not accompanied by open source tools. In our analysis, we focused on cooperation among robots; however, only in 2019, we encountered some papers that study cooperation from a human-robot interaction viewpoint [6, 150, 188].

Overall, there is very little about the stochastic details of communication protocols. The studies in this category mostly focus on verification of movement of robots (i.e., gathering and merging).

Qualitative

Swarm

With respect to swarms, a number of theoretical studies [119, 120, 121] focus on scaling up the parametrised model checking problem to large swarm sizes. They employ various types of epistemic extensions of CTL as property specification languages. Their models include case studies on clustering of swarms, which synchronise to gather in a certain area. Cybulski et al.’s contribution [56] to the field is a simulation framework for the behaviour of UAV swarms. The framework also allows for performing simulations with a user-defined map of the environment.

Cooperative

Regarding the use of qualitative models of cooperative systems, our search only resulted in three studies, two of which employ variations of Petri nets as their models. Lill et al. [132] make use of Petri nets to develop models of cooperative forklifts. The forklifts communicate to decide which has the priority when passing through narrow pathways. Sagglieti et al. [191] employ Coloured Petri nets and classify cooperation in three distinct levels: perception-based, reasoning-based, and action-based cooperation. To demonstrate their strategy, they model platooning-like scenarios where different robots follow each other. The third study, by Doan et al. [64], is a more theoretical study of parametric model checking that employs model checking for multiple small robots gathering in circular configurations using a ring-based topology. Their study focuses on model checking the underlying distributed system and the properties written in LTL.

Quantitative

Swarm

Three out of four papers in this category deal with probabilistic behaviour: Lomuscio et al. [135] perform model checking on probabilistic LTL, while Anmin et al. [7] and Brambilla et al. [37] both describe properties in PCTL. Cavalcanti et al. [47], however, models timed dynamics in CSP.

Cooperative

The studies that have been classified in this category employ variations of temporal logic as their properties, such as LTL [103, 107, 158] and CTL [16]. As an exception to that list, Bicevskis et al. [33] provide a simulation environment in Simulink.

Formal

Swarm

Kouvaros et al. [119, 120, 121] provide several theoretical studies in the field of formal verification and model checking of autonomous systems. They have demonstrated the applicability of their strategy on a case study of gathering of UAVs swarms. With respect to probabilistic systems, three studies have been found: Lomuscio et al. [135] offer a strategy for parameterised model checking of probabilistic LTLs. Furthermore, Brambilla et al. [37] provide a property-driven design method for probabilistic swarms that is checked using Prism. Last, Amin et al. [7] verify probabilistic behaviour expressed via PCTL properties using UPPAAL. They assert deadlock freedom, safety liveness checks, and reachability computations.

Cooperative

The majority of studies employing formal methods in analysing RAS used model checking [16, 33, 64, 103, 107, 158, 158]; after model checking, the most frequently used technique is model-based testing [132, 190, 191]. Arcile et al. [16] and Kamali et al. [107] investigate car platooning manoeuvres. In the former, the vehicles are modelled as timed automata, and UPPAAL is used as a model-checking tool. In the latter, joining and exiting operations are modelled in Belief Desire Intention models and model-checked using AJPF. They focus on abstracting a formal (untimed) model from an agent (timed) model and checking the correspondence between the agent model and the code. With respect to probabilistic model-checking, Muhammad et al. [158] model robots that synchronise to position themselves in an attempt to guarantee coverage of a certain area. The models are Markov Decision Processes that are checked using PRISM. Humprey et al. [103] make use of the model checker Spin to investigate cooperation between UAVs and sensors and collaboration among sensors as well.

Informal

Cooperative

The only study in this category [67] provides a simulation environment that integrates existing solutions for simulation of multi-UAV applications such as Gazebo (for robotic simulation), ArduPilopt (for UAV control algorithms), and NS3 (for network simulation). Their case study is a model of a leader-follower application for large convoys of UAVs.

Effectiveness

As noted before, there are very few measures of effectiveness used for evaluating the system or the testing technique:

Swarm

Amin et al. [7] use very generic notions of effectiveness for the system under test, namely, deadlock freedom, safety, liveness, and reachability. Brambilla et al. [37] go beyond that and in addition to some domain-agnostic properties, such as probability of satisfying the requirements (safety), they measure aggregation time of the swarm and improvement of the behaviour (in terms of objects retrieved).

Cooperative

Muhammad et al. [158] measure the probability of task completion and human interaction to measure efficiency in wireless sensor networks. Arcile et al. [16] measure the number of collisions in its vehicle verification approach.

Efficiency

Swarm

Lomuscio et al. [135] count the number of sates and transitions as a measure of efficiency of their formal verification methodology.

Cooperative

Muhammad et al. [158] also measure the time of task completion and human interaction as a measure of efficiency for the system under test. Arcile et al. [16] measure the travel time for the system under test and the verification time as efficiency metrics for the respective testing technique. D’Urso et al. [67] measure simulation time as a measure of efficiency of their integrated simulator.

Coverage

Cooperative

Saglietti, Winzinger, and Lill [191] consider state coverage in the analysis of their model-based reconfiguration testing strategy, while Bicevkis et al. [33] conduct testing of collaborative UAV and UGV and consider “the complete test set” as a measure test coverage. Lill and Saglietti [132] model Petri nets entities and address the maximisation of interaction coverage while minimising the amount of test cases.

Open-source

Swarm

With respect to swarms, two open-source tools have been reported: the PSV-CA tool [135] can model-check probabilistic PLTL properties for swarm systems, and ARGOS [37] is a multi-physics robot simulator that can simulate large-scale swarms and can be customised via plug-ins.

Cooperative

Only two open-source tools have been found for cooperative robots: VerifCar [16] allows for fault injection in models for UPPAAL model checking. GzUAV [67], meanwhile, is a simulation tool for connected UAVs.

Public

Swarm

The only public, non-open source tool that as been employed in the testing of swarms is the RoboTool by Cavalcanti et al. [48]; RoboTool supports modelling and model checking (through the FDR tool [85]). The tool has been applied to a UAV swarm case study.

Small

Swarm

Kouvaros and Lomuscio [119, 120] study parameterised verification of robot swarms against temporal-epistemic specifications and model a small, theoretical, robot swarm. Cavalcanti et al. [47] introduce Robochart, which allows for modelling and verification of interacting robots. Cybulski [56] provides mathematical models of a UAV swarm that can be simulated in their proposed framework. Amin et al. [7] present a formal verification approach using timed automata for the verification of path planning of robot swarms.

Cooperative

Poncela and Aguayo-Torres [174] conduct a case study where they test underwater robots’ wireless communication. Lill et al. [132] make use of Petri nets to develop models of cooperative forklifts and simulate scenarios where the robots decide which one has the priority when passing through narrow pathways. Humprey et al. [103] make use of the model checker Spin to investigate cooperation between UAVs and sensors and collaboration among sensors as well. Saglietti, Winzinger, and Lill [191] use coloured Petri nets to model interacting autonomous agents and generate test cases for reconfiguration scenarios.

Benchmarks

Swarm

The only reported benchmark for this category is by Brambilla et al. [37], where they investigate aggregation and foraging manoeuvres on large-scale swarms of multiple sizes.

Cooperative

D’Usrso et al. [67] evaluate their methodology on a number of test programs using different UAV sizes. They aim their evaluation at testing (i) the scalability of the solution and (ii) its performances by comparing the simulation time with respect to physical execution time.

Industrial

Cooperative

The only reported industrial case study is by Rooker et al. [185]. They make use of a simulation tool in the smart farming domain. Land and air robots are modelled using real dynamics and cooperate to complete farming tasks.

RQ1

Regarding the models used for analysing cooperative scenarios in RAS, we notice that formal probabilistic models (based on variations of temporal logic [119, 120, 121], process algebra [48], and timed automata [7, 16]) are the most-used types of models. Often these models are used for the purpose of model checking abstract models of cooperative scenarios. Informal models are used less frequently, and only as input to simulation tools [67]. Qualitative and informal models are used far less often in this context.

RQ2

Most notions of effectiveness and efficiency are the generic notions such as state-space size, verification time, and test coverage [33, 132] for the technique and deadlock freedom and probability of temporal logic formula satisfaction for the system under test. The only exceptions where domain-specific notions of efficiency and effectiveness were used concern aggregation time of the swarm [37] and effectiveness of human-robot interactions [158].

RQ3

There is clearly a lack of tools for testing cooperative and swarm scenarios in RAS. The only exceptions are public model checking tools [16, 37, 48, 135] and a simulation tool for connected UAVs [67].

RQ4

Very few studies have evaluated their interventions on industrial-scale case studies [185] and benchmarks [33, 37, 67].

Analysis for Researchers.

Gaps: An analysis of the included studies reveals that in cooperative scenarios for RAS, the role of communication networks and protocols and their effect on functionality, safety, and reliability of the RAS system is severely understudied. Integrating the body of knowledge available in communications with the testing and verification of RAS is clearly an area for future research. The very few available studies do not provide domain-specific measures of efficiency and effectiveness that pertain to the cooperative aspects and the emerging cooperative behaviour. Moreover, there is a lack of sufficient evidence of strategies being applied to industrial-scale case studies and benchmarks.

Strengths: There is certainly a strength in abstract theories for parameterised model checking of swarms. Apart from that, there is no other concentrated area of strength.

Analysis for Practitioners.

Gaps: As noted above, we do not think we have reached sufficient maturity in the research results for cooperative and swarm robots to be able to apply them in practice. Even the existing techniques have not been applied to many industrial case studies yet, and no stable tool-sets are available at the moment. Working with researchers to define meaningful notions of efficiency and effectiveness as well as providing benchmarks and industrial case studies could lead to an impactful future research agenda.

Strengths: There are no practical areas of strength in testing cooperative and swarm RAS scenarios.

6.1.3 Testing Strategy.

Table 10 provides an overview of the testing strategies used for RAS. By far, the most widely used strategy is formal verification, followed by simulation and runtime monitoring, respectively. Model-based testing is the least-researched strategy.

Table 10.

Qualitative

Simulation

Heitmeyer et al. [98] synthesise state-based formal models (Software Cost Reduction tabular models) from scenarios specified in Mode diagrams (extensions of Message Sequence Charts). The models are used in a simulator integrated with the eBotworks simulation tool. Cybulski et al. [56] developed a simulation tool for UAV based on class- and activities diagrams. Further, their framework allows for user-defined maps of the environment.

MBT

Two search-based testing approaches are employed in this category: Lill and Saglietti [132] employ genetic algorithm to maximise coverage in the Petri nets models for generating test cases. Analogously, Nguyen et al. [163] provide a multi-step process to verify correctness of autonomous agents. They make use of multi-objective evolutionary algorithms to cover stakeholder soft goals. Araiza et al. [14] and Andrews et al. [10] focus on human robot interaction. The former generate test cases from BDI models, while the latter focus on coverage to generate test cases from Petri nets. Another model-based testing tool that uses Petri nets is Reference [191], which uses Coloured Petri nets and structural coverage metrics to generate test cases for reconfiguration scenarios. Finally, Hagerman et al. [93] combine a behavioural model with attack and mitigation analyses to generate a security test suite for UAVs.

Formal Verification

The vast majority of papers in this category perform formal verification based on properties specified in variations of temporal logic; since autonomous systems specifications typically involve aspects such as beliefs and intentions, several studies are dedicated to studying the theoretical boundaries (e.g., (un)decidability) of verifying epistemic extensions of temporal logics [119, 120, 121, 122, 134, 136]; a notable tool used in this context is the MCMAS model checker [136], which is also evaluated on a small-scale benchmark against the general-purpose model checker NuSMV.

Many theoretical studies study the issue of abstraction for parameterised specifications, where parameters can be the number of autonomous agents [119, 120, 121, 122, 134, 136] or the size and shape of the arena [8, 9]. Aminof et al. [9] investigate the decidability problem for parameterised grid sizes. They found that restricting the grid size results in the problem being solved in Pspace. In the same vein, Rubin et al. [8] establish a framework in which to model and automatically verify autonomous agents. The framework contains an algorithm tailored to solve a parameterised verification problem where they use the model graphs as parameter.

Coming up with temporal logic specifications is known to be difficult and requires some level of formal training. To alleviate this, a few papers focus on writing LTL properties. Webster et al. [221, 222] model scenarios for a robot in the healthcare sector; they use Brahms as the language to describe human-robot interaction scenarios, and the properties are written in LTL. Babiceanu et al. [17] combine LTL and Event-B to build models of trustworthiness for small unmanned aerial systems (sUAS).

Formalisation of and applying formal verification on different cognitive architectures have been the focus of many studies.

Bhattacharyya [31] formalise a rule-based representation of cognitive architecture using Soar framework [124] UPPAAL and connect the verification agent to the simulation environment. They model an auto-pilot avionic system and analyse contingency situations during takeoff.

The Belief-Desire Intention framework is another natural cognitive architecture for specifying autonomous agents, and it has been extensively used in the literature. Several studies make use of the MCAPL framework [61]. Dennis et al. [60, 62] verify ethical aspects of autonomous agents’ interactions with people by modelling their behaviour using the BDI models and capturing ethical priorities; these ethical models are subsequently model-checked against LTL specifications using the MCAPL framework. Furthermore, Ferrandes et al. [71] model autonomous vehicle components and also use MCAPL to formally verify their BDI models. Last, Ferrando et al. [72] go further and provide an approach that combines formal verification and runtime monitoring by specifying trace behaviour in Prolog and connecting that with a JAVA implementation (using the JPL framework³) for runtime monitoring.

Sun et al. [205] study the effect of neural network components in the behaviour of autonomous systems; in particular, Rectified Nonlineary in Neural Networks, and analyse it using Satisfiability Modulo Convex (SMC). To mitigate the verification effort, they perform a pre-processing and evaluate the effect of pre-processing on the verification time, measured in the number of neurons in the Neural Network.

With respect to the verification of safety on human-robot interaction, Vicentini et al. [216] provide safety assessments by formally verifying models written in TRIO temporal logic [84]. Their strategy aims to identify hazardous situations associated with non-negligible risks. Analogously, Farulla and Lampretc [69] focus on model checking security properties formulated in Computational Tree Logic (CTL).

Selvaraj et al. [197] evaluate the application of different formal techniques to verify control software for an autonomous vehicle. They investigate the application of Supervisory Control Theory, Model Checking, and Deductive Verification and provide insights on how these different approaches can address different industrial challenges.

Quantitative

Simulation

Most interventions in this category focus strictly on introducing a simulation tool for a specific sub-domain, such as underwater robots [52, 148, 212], vehicles [54, 200, 232], robots [100, 126], and UAVs [139, 161].

However, a number of interventions combine a simulation approach with other testing aspects. Li et al. [131] employ a game-theoretical approach where vehicles have different levels of knowledge about other vehicles. For instance, a level 0 car has no knowledge about the other cars, and a level k car has information about level k - 1 cars. Strangely, they show that, in some instances, lower-level cars effect less constraint violations.

Szalay et al. [206] provide a scenario-in-the-loop simulation using SUMO and Unity engine. They simulate simplified platooning and valet parking scenarios in both the simulation and in a real smart city (Salazone).

Two studies present supporting libraries: Koolen et al. [117] implement robotic simulation library in the Julia programming language. The library offers support for robot dynamics, visualisation, and control algorithms. Rohmer et al. [184] developed libraries to integrate VREP and other programming languages (Lua, C++, Java, Python, Matlab and Ruby) with support for different types of 3D objects and modules for kinematics and dynamics.

Model-based Testing

Multi-objective search is an increasingly popular technique for coping with complex robotics systems. Betts et al. [30] employ the Monte Carlo search heuristic to verify the lateral distance between the outcome of surrogate-based models compared to a known ground truth in UAV applications. A similar approach is used by Nejati et al. [25] on pedestrian detection using vision-based system. They employ NSGA-II [59] using minimum distance and minimum time to collision as fitness functions and compare the performance of the heuristic with and without surrogates.

Sagliatelli and Meitner [190, 191], however, propose multiple notions of coverage to help generate test cases from Petri nets models of autonomous, cooperative, and reconfigurable robots. Furthermore, they employ statistical testing techniques, which intend to evaluate the degree of acceptance of the behaviour observed.

A framework for automated testing using metamorphic testing principles combined with model-based testing is employed by Lindvall et al. [133]. Test cases are generated from test models and multiple variations of scenarios that are programmatically generated based on metamorphic relations.

Runtime Monitoring

In this category, verification checks at runtime have been typically coupled with model checking strategies, where properties are being checked during the system execution. For instance, Desai et al. [63] present an STL-based framework where assumptions used during model checking hold at runtime, where the online monitor checks robustness on partial trajectories from low-level controllers.

In the context of obstacle avoidance, Luo et al. [140] employ JavaMOP to verify that a robot does not behave against requirements written in FSM and PTLT languages. Temporal properties are also employed by Wang et al. [220], where the RoboticSpec specification language for robotic applications is translated into a framework for online monitoring that also uses PLTL properties.

Huang et al. [102] present ROSRV, which is an online monitoring framework that runs on top of ROS. They make use of a public-subscribe communication architecture and intercept commands and messages passing through the communications channel. This way, they are able to verify safety and security requirements at runtime using a domain-specific language.

Open-source

Simulation

Manhaes and Rauschenbach present UUV simulator [148], which is an extension of Gazebo accommodating the domain-specific aspects of underwater vehicles. They assist with modelling of underwater hydrostatic and hydrodynamic effects, thrusters, sensors, and external disturbances and demonstrate their tool on a case using a modified model of the Sperre SF 30k ROV robot (RexROV).

As another tool for underwater robots, MARS [212] provides simulation environments for marine swarm robots that allows for hardware-in-the-loop simulation. The tool has a Java interface and has been applied to the MONSUN and HANSE autonomous underwater robots.

In the Matlab environment, the FROST tool [100] is an open-source Matlab toolkit for modeling, trajectory optimisation, and simulation of robots, with a particular focus in dynamic locomotion. In the study, they model the ATLAS and DRC-HUBO as examples.

Munawar and Fischer [160] present the Asynchronous Framework, which incorporates real-time dynamic simulation and interfaces with learning agents to train and potentially allow for the execution of shared sub-tasks. Due to the asynchronous nature of the communication, they measure the number of packets against latency. Furthermore, they focus on surgical robots as part of their application domain, and they employ the CHAI3D haptics framework. They connect their tools with ROS, which allows them to connect to learning libraries such as TensorFlow.

D’Urso, Santoro, and Santoro [67] also present a simulator for multi-UAV applications, called GzUAVChannel. It works as a middleware that combines Gazebo, Autopilot, and NS-3 network simulator to provide a 3D visualisation engine, a physics simulator, a flight control stack, and a network simulator to handle communications among unmanned aerial vehicles. They model a leader-follower example.

The MoVE tool [54] provides the possibility of modelling pedestrian behaviour. The framework focuses on testing autonomous system algorithms, vehicles, and their interactions with real and simulated vehicles and pedestrians. They conduct three case studies: traffic wave observation, medical evacuation, and virtual vehicles avoiding real pedestrians.

Koolen et al. [117] implement robotic simulation library in the Julia programming language. The library offers support for robot dynamics, visualisation, and control algorithms.

Brambilla et al. have developed ARGOS [37], which is a multi-physics robot simulator that can simulate large-scale swarms and can be customised via plug-ins.

Cieslak et al. introduce Stonefish, a geometry-based simulator [52] that can be integrated with ROS. Last, the MARS [212] tool provides simulation environments for marine swarm robots.

Lugo-Cárdenas, Luzano, and Flores [139] introduce a 3D simulation tool for UAVs whose focus is on assisting the development of flight controllers.

Formal Verification

Parametric modelling of CAVs as a network of timed automata is used by Arcile et al. [16]. In this work, VerifCar tool is applied to assess the impact of communication delays on the decision algorithms of CAVs and to check the robustness and efficiency of such algorithms.

Gruber and Althoff [91] present a reachability analysis tool (Spot) that finds counter-example to property violations. It starts with a coarse model of the system dynamics but it can refine the abstraction levels for precision/scaling.

Desai et al. [63] present a runtime verification framework (DRONA) based on Signal Temporal Logic [144], where an online monitor checks robustness on partial trajectories from low-level controllers.

RoVer [175] provides visual authoring of HRI, formalisation of properties in temporal logic and verification (via model-checking with PRISM [123]) that the interactions abide by a set of social norms and task expectations whose goal is to identify social norms violation.

Althoff introduces IMPROV [6], a tool that is used to formally verify human-robot interaction for modular robots.

Gainer et al. [77] provide a tool (CRutoN) for translation to formal models from control rules of robots in a DSL and input those models into NuSMV for model checking. Their main emphasis is the verification of human-robot interaction.

Bao et al. [18] present a prototype tool for parametric statistical model checking that can cope with complex parametric Markov chains where state-of-the-art tools (such as PRISM) have timed out. They provide evidence of their tool efficiency by conducting an industrial case study.

The FLPolyVF tool [4] connects functional verification, sensor verification, diagnostics, and industry/regulatory communication of autonomous vehicles while checking the effects of using different (matrix-based) scenario abstraction levels.

Lomuscio et al. have developed the MCMAS model checker [136]. They used a logic (ATL-k) alternating temporal logic (epistemic), one of the few tools that uses CTL. They demonstrate their strategy in a couple of small-scale examples, where they compare their strategy against alternatives (NuSMV and MCTK).

Runtime Monitoring

Huang et al. present ROSRV [102], which is a runtime verification framework that can be used with ROS.

Desai et al. [63] present a runtime verification framework based on Signal Temporal Logic, where an online monitor checks robustness on partial trajectories from low-level controllers (in the context of surgical robots).

Public

Simulation

Cavalcanti et al. [46, 47, 48, 156] introduce RoboTool, supporting graphical modelling, validation, and model checking (via FDR [85]) of robotic models written in RoboChart [156] and RoboSim [48].

Shah et al. [198] introduce the AirSim simulator that generates training data for building machine learning models used in autonomous aircraft. It offers physical and visual simulation, including models of physics engine, vehicle, environment, and sensors. Further, it connects to an API for planning and control.

Zhang et al. [233] introduce CyberEarth, a framework for program-driven simulation, visualisation and monitoring of robots. The tool integrates modules from several other open-source tool such as ROS [177] and OpenSceneGraph (OSG) [43].

Model-based Testing

Mullins et al. [159] developed a tool (RAPT - Range Adversarial Planning Tool) for generating test scenarios to be employed on the System Under Test. The tool employs an adaptive search method that generates challenging scenarios based on the performance and results of the previous ones. A clustering algorithm ranks the scenarios based on the performance type and how close they are to the boundaries of each cluster. The boundaries are based on notions of efficiency (precision and convergence), diversity (how many performance boundaries are being covered), and scaling.

Formal Verification

The only tool in this category is RoboTool [48], which has also been described above in the Simulation category. It provides formal verification via translated CSP models fed into the FDR model checker [85].

Private

Simulation

Heitmeyer and Leonard [98] introduce two tools integrated into the FORMAL framework; the tools synthesise and validate formal models. The first tool synthesises a formal Software Cost Reduction (SCR) requirements model from scenarios, and the second tool combines the existing SCR simulator [96] with eBotworks 3D simulator to allow for simulation of continuous components. They focus on the verification of human-machine interaction.

Verma et al. [215] present a Flight Software simulator (SSIM - part of Rover Sequencing and Visualisation Program (RSVP) suite) that is used to simulate MARS Rover missions. The simulator assists in predicting the behaviour of semi-autonomous systems by providing the capability for human operators to check if their intent is correctly captured by the robot prior to execution in different scenarios and environments.

Zhang et al. present RoadView [232], a photo-realistic simulator that tests performance of autonomous vehicles and evaluates their self-driving tasks. They make use of driving scenarios where they compare autonomous vehicle to a human-driven scenario to demonstrate their tool.

Schöner presents a simulation tool that is part of the (industrial) Pegasus framework [195]. It integrates sensors, traffic, and road models (in open-drive Format) into the simulation where different scenarios and situations are executed.

Model-based Testing

Collet et al. [53] introduce RobTest, a tool for generating collision-free trajectories for stress-testing of single-arm robots. It employs constraint programming techniques to solve continuous domain constraints in its trajectory generation process. The efficiency of such a process is evaluated in a controlled experiment where the generation time of acceptable near-optimal trajectories.

Formal Verification

O’Kelly introduces APEX [165], which is a formal verification tool for verifying vehicle dynamics, trajectory planning, and tracking stacks of ADAS in vehicles. Property specifications are written in metric interval temporal logic. The tool calls DReach [116] in the background to perform reachability analysis on the vehicle trajectories.

Foughali et al. [75] implement an automatic translation from GenoM [145], a robotics model-based software engineering framework, to the formal specification language Fiacre [28], which can be fed into TINA for model checking (on the Petri net models). They apply their tool to an autonomous ground vehicle (RMP 400 Segway).

Small

Simulation

Regarding small-scale case studies for simulation environments, the vast majority conduct a simple demonstration to illustrate features of their simulation tools [11, 42, 48, 50, 56, 95, 139, 232, 233], such as Mars [212] (for underwater robots) and VREP [184] (for generic robots) case studies.

Differently, Li et al. [131] employ a game-theoretical approach where vehicles have different levels of knowledge about other vehicles. For instance, a level 0 car has no knowledge about the other cars and a level k car has information about level k - 1 cars. Strangely, in their case study, they show that, in some instances, lower-level cars effect less constraint violations.

MBT

Two of the case studies in this category consider generating test cases from formal models [132, 163] of autonomous agents. Furthermore, Andrews et al. [10] model autonomous systems and their environment using Petri nets to generate test cases and apply their technique to a case study in the human-robot interaction domain. In Hagerman’s case study [93], finite state machines are used to extract security test suites. Sagglietti et al. [190, 191] conduct a case study in which the reconfiguration behaviour of autonomous agents is verified. Betts et al. [30] compare the effectiveness of two search-based testing methods, with a case study involving a UAV flight control software.

Formal Verification

Several of the included case studies in this category concern abstract representations of multi-agent autonomous systems and provided efficient algorithms for parametric (formal) verification or state-space reduction techniques [18, 22, 24, 119, 120, 122, 135, 136].

Several other case studies [19, 46, 47, 48, 137, 204] concern systems used in model checking tools such as Prism [123] and FDR [85]. Another use of these case studies is to demonstrate usage of introduced tools such as APEX [165] and MDE [147]. Differently, Dennis et al. [60, 62] focus on formalising and verifying ethical concerns in BDI agents and provide corresponding small case studies. Aminof et al. [9] investigate the decidability problem for parameterised grid sizes. In their case study, they found that restricting the grid size results in the problem being solved in Pspace.

Runtime monitoring

In the only paper in this category, Desai, Tomasso, and Seshia [63] make use of an STL-based (signal temporal logic) online monitoring system to ensure that the assumptions about the low-level controllers (discrete models) used during model checking hold at runtime. They demonstrate the strategy in a surveillance application case study.

Industrial

Simulation

Zhou et al. [237] introduce a framework for virtual testing of advanced driver assistant systems that uses real-world measurements. Shah et al. [198] build a model of quadrotor with pixhawk controller in their newly developed simulator, AirSim, that includes a physics engine and and supports real-time hardware-in-the-loop. Schöner presents a simulation tool that is part of the (industrial) Pegasus framework [195]. It integrates sensors, traffic, and road models (in open-drive Format) into the simulation where different scenarios and situations are executed. Reference [185] demonstrates their validation framework for autonomous systems in a farming context with simulations and field testing.

Uriagereka et al. [214] conduct simulation-assisted fault injection to assess safety and reliability of robotic systems. The feasibility of their method is demonstrated by applying it to the design of a real-time cartesian impedance control system. Manhaes and Rauschenbach [148] model the Sperre SF 30k ROV underwater robot (RexROV) in the demonstration of the simulator for unmanned underwater vehicles. Verma et al. [215] present a Flight Software simulator that is used to simulate MARS Rover missions. They demonstrate their approach with a corresponding case study. AbdElSalam et al. [1] use Hardware Emulation-in-the-loop to verify Electronic Control Units (ECUs) for ADAS systems.

MBT

In the only industrial case study in this category, Abdessalem et al. [25] generate test cases for a system that can visually detect pedestrians in the context of advanced driver assistance systems (ADAS).

Formal Verification

Gainer et al. [77] conduct a case study in the context of verification of human-robot interaction using the Care-O-Bot robotic assistant. Bhattacharyya et al. [31] apply formal verification methods to an autonomous CoPilot agent.

Runtime monitoring

Gladisch et al. [87] select case studies that use industrial automated driving (adaptive cruise control, lane keeping, and steering control scenarios) to evaluate their search-based testing strategy.

Benchmarks

Simulation

Several benchmark-scale case studies can be found in this category. In what follows, we briefly discuss some of them. Wigand et al. [224] have developed CoSiMA, which is an architecture for simulation, execution, and analysis of robotics system. They conduct experiments on the humanoid robot COMAN. Tosum et al. [213] present a design framework that facilitates the rapid creation of configurations and behaviours for modular robots. They have demonstrated their framework on the SMORES robot. Pereira et al. [172] employ several small case studies in their attempt to couple two simulators, namely, SUMO and USAR-Sim. Brambilla et al. [37] model a probabilistic swarm that is checked in PRISM to evaluate their property-driven design method. Bohlmann, Klinger, and Szczerbicka [36] automatically generate a model of a parallel delta robot on-the-fly. Their method for model generation is based on machine learning and symbiotic simulation techniques. Mutter et al. [161] also explore the simulation of UAV models in Simulink and discuss the results when combining the platform and environment models. Bi et al. [32] present a deep Learning-based framework for traffic simulation and execute several scenarios of intersections with and without pedestrians. D’Urso, Santoro, and Santoro [67] simulate leader-follower UAV scenarios in their framework. Their goal is to combine four simulation environments: a 3D visualisation engine, a physics simulator, a flight control stack, and a network simulator. Wang and Cheng [219] present a hardware-in-the-loop simulator for drones that can generate synthetic images from the scene as datasets, detect and verify objects with a trained neural network, and generate point cloud data for model validation. They simulate and conduct filed testing on a physical UAV. Heitmeyer et al. [98] synthesise software cost-reduction models of multiple autonomous systems to be used in a simulator integrated with the eBotworks simulation tool.

MBT

Proetzsch et al. [176] use a designed DSL (graph-based models) to describe the system behaviour of the autonomous off-road robot RAVON. The model is used as test model for generating test cases. Mullins et al. [159] have developed a tool (RAPT - Range Adversarial Planning Tool) for generating test scenarios to be employed on the System Under Test. Their tool is applied to realistic underwater missions. Furthermore, in their case study, Araiza-Illan, Pipe, and Eder [14] use BDI models and model checking of probabilistic timed automata (in UPPAAL) to generate test sequences for human-robot collaboration tasks.

Formal Verification

Here, we briefly discuss some of the benchmarks that involve formal verification. Halder et al. [94] use the physical robot Kobuki as a case study, over which properties are automatically verified using the UPPAAL model checker. The focus of their approach is to model and verify ROS systems using real-time properties. Brambilla et al. [37] model a probabilistic swarm that is checked in PRISM to evaluate their property-driven design method. Bicevskis, Gaujens, and Kalnins [33] develop models for the testing of UAV and UGV collaboration in the Simulink environment. Althoff et al. [6] propose a framework (IMPROV) for self-programming and self-verification for robots, which is demonstrated on a physical robotic arm. Zhao et al. [234] model an unmanned aerial vehicle (UAV) inspection mission on a wind farm and, via probabilistic model checking in PRISM, show how the battery features may affect verification results. Gruber and Althoff [91] present a reachability analysis tool (Spot) that finds counter-example to property violations. Their tool is evaluated using the CommonRoad benchmark PM1:MW1:DEU_Muc-3_1_T-1.

Runtime monitoring

Pasareanu, Gopinath, and Yu [170] present a compositional approach for the verification of autonomous systems and apply the technique on a neural network implementation of a controller for a collision avoidance system on the ACAS Xu unmanned aircraft. Temporal properties are employed in Wang’s case study [220], where the RoboticSpec specification language for robotic applications is translated into a framework for online monitoring that also uses PLTL properties. Huang et al. conduct a case study using a model of the LandShark UGV to demonstrate their tool, ROSRV [102], which is a runtime verification framework that can be used with ROS. In the context of obstacle avoidance, Luo et al. [140] employ JavaMOP in their case study to verify that the robot does not behave against requirements written in FSM and PTLT languages.

RQ1

By far, quantitative testing techniques are the most widely researched strategies (this was also a common observation for the domain and connectivity aspects).

RQ2

Among the measures used for evaluating interventions, efficiency is most often used, with effectiveness being a close second. Few interventions, however, were evaluated using a notion of coverage [14, 33, 93, 132, 143, 190, 191]. It is notable that, for runtime monitoring, only two publications [87, 102] employ an efficiency metric.

RQ3

There is a considerable lack of tools for model-based testing and runtime monitoring. For simulation and formal verification, there seems to be some considerable strength in terms of tool support.

RQ4

Approximately 54\(\%\) of the interventions used small-scale case studies for their evaluations, while only 10\(\%\) evaluated their strategy in an industrial context, indicating a clear gap.

6.2 For Researchers

Throughout the various categories we have coded in this case study, the most prominent gap is in the use of agreed-upon rigorous measures to evaluate the efficiency and effectiveness of the interventions as well as real-world benchmarks that can be used to evaluate such measures. As observed in the earlier sections, much of the measures of efficiency and effectiveness measures are very generic, and there is also a relative gap in domain-specific measures suitable for the RAS sub-domains. A lack of domain-specific modelling languages and the limited number of runtime verification approaches indicate that there is room for improvement in RAS testing strategies.

Another considerable gap is in the use of quantitative specification languages to specify the desired properties of the system; due to the inherent heterogeneity of RAS, we need to have property languages that cover aspects such as the combination of discrete and continuous dynamics as well as stochastic and epistemic aspects that may be used to model the aspects of behaviour concerning the environment and the users. Connected to this point is the relative gap in interventions that perform a quantitative analysis of the system and provide quantitative metrics of quality as the outcome of the test. Some starting points in this direction are the use of quantitative properties that incorporate probabilistic and stochastic- [37, 171], timed- [94, 165, 201], and continuous dynamical [63] aspects of RAS. We have also noted the use of a specification language that caters for a combination of stochastic and continuous aspect of RAS [137]. On the contrary, there is a relative strength in using qualitative models, including property specification languages as predicate- [8] and temporal logics [19, 41, 75, 103, 107, 140, 170, 203], as well as epistemic extensions thereof [121, 134]. Also, there is a wealth of studies on the use of discrete relational [149], state-based [37, 47, 93, 140, 156, 228, 230], and belief-based [13, 14, 107, 163] abstract models in testing and verification of RAS. Also, several studies used informal simulation models for simulation tools such as Gazebo and USARSim [13, 14, 42, 42, 51, 102, 127, 172]. A suitable middle-ground may be the semi-formal and domain-specific models such as those built in Matlab/Simulink [25, 30, 161].

Regarding techniques, most of the techniques used so far in the literature have been formal verification techniques applied on (relatively high-level) qualitative [8, 41, 71, 121, 134, 147, 187, 228, 230] or quantitative [12, 19, 33, 37, 47, 63, 74, 75, 94, 103, 107, 110, 137, 156, 165, 170, 171, 189, 201, 218, 225] models of RAS. There is also some strength in the use of informal simulation techniques [37, 42, 126, 130, 131, 139, 156, 161, 172, 192, 232]. We have seen relatively few model-based testing [10, 14, 25, 30, 93, 132, 163, 176, 191] and runtime verification [63, 102, 140, 170] techniques that have been applied to (models of) complex and detailed RAS. We hence see a gap, and a trend towards closing this gap, in dynamic and non-exhaustive testing of RAS techniques.

Finally, lack of public tooling is a major gap observed in the literature. There are very few techniques that are accompanied by a tool, and there are very few public tools for testing RAS [47, 71, 75, 102, 121, 139, 156, 218, 232].

6.3 For Practitioners

The most significant gap is lack of industrial evaluation of existing interventions. There have been very few interventions applied in an industrial context and to systems of industrial complexity [1, 2, 25, 31, 70, 77, 87, 105, 113, 148, 153, 185, 185, 194, 195, 197, 198, 208, 214, 215].

Unfortunately, the number of interventions is too small to conclude any meaningful trend and indication of strong evidence for applicability in the industrial setting. Among the proposed interventions, most either concerned simulation-based testing [148, 215] or connected the results of their verification to some simulation tool (mostly based on ROS-Gazebo integration) [1, 70, 153]. Search-based testing [2, 25, 87] and interaction testing [2, 208] are two notable techniques that have been used in industrial contexts. Among the models employed in the industrial context, variants of state machines [215] and fault trees [214] can be mentioned. A notable study in this regard [197] is a comparison of supervisory-control, deductive- and inductive (model-checking) verification techniques in the industrial context.

The human- and information-source is another aspect of testing interventions that is a severely understudied. We note a recent trend in combining user studies (in the sense of human-computer- and human-robot interactions) and traditional testing, validation, and verification techniques [6, 150, 188].

Also, there is a gap in defining and evaluating testing processes, particularly in industrial contexts.

The lack of industrial- and domain-expert input into the models and techniques is evident and has led to generic and relatively simple modelling techniques and property languages being used for most intervention. Co-production with industrial partners can enrich these aspects and lead to models that can deal with the heterogeneity and complexity of industrial RAS.

7 Conclusion

We performed a systematic review of the interventions for testing robotics and autonomous systems to answer the following research questions:

(1)

What are the types of models used for testing RAS?

(2)

Which efficiency and effectiveness measures were introduced or used to evaluate RAS testing interventions?

(3)

What are the interventions supported by (publicly available) tools in this domain?

(4)

Which interventions have evidence of applicability to large-scale and industrial systems?

To this end, we started off by performing a pilot study on a seed of 26 papers. Using this pilot study, we designed and validated a search query, designed rigorous inclusion and exclusion criteria and developed an adaptation of the SERP-Test taxonomy. Subsequently, we went through two phases of search, validation and coding, in total going through 10,534 papers. We finally coded the set of 192 included papers and analysed them to answer our research questions.

A summary of the findings of the review with regards to our research questions is provided below:

(1)

There is a wealth of formal and informal models used for testing RAS. In particular, there is a sizeable literature on using generic property specification languages (such as linear temporal logic) and qualitative modelling languages, such as variants of state machines, UML diagrams, Petri nets, and process algebras. There is a clear gap in quantitative modelling languages that can capture the complex and heterogeneous nature of RAS. There is also a lack of domain-specific languages that can capture domain knowledge for various sub-domains of RAS.

(2)

We observed a gap in rigorous and widely accepted metrics to measure effectiveness and efficiency, and adequacy, of testing interventions. Similar to the previous items, those measures used in the literature are very generic and do not pertain to the domain-specific aspects of RAS. Hence, there is a gap and a research opportunity for defining and evaluating rigorous (domain-specific) measures for efficiency, effectiveness, and adequacy for RAS testing interventions.

(3)

There are a considerable number of interventions that rely on public tools to implement or evaluate their interventions. However, there are very few that make their proposed/evaluated interventions available for public use in terms of publicly available tools. There is hence a considerable gap in providing datasets and public tools for further development of the field.

(4)

There are less than a handful of testing interventions that have been evaluated in an industrial context. There have been some other interventions that used some real robots or autonomous systems, but in an academic context. This signifies the importance of future co-production between academia and industry in industrial evaluation of testing interventions for RAS.

Acknowledgments

We would like to thank Jan Tretmans and Wojciech Mostowski for comments and discussions at the early stage of this research. Moreover, we would like to thank Thomas Arts, Michael Fisher, Mario Gleirscher, Robert Hierons, Fabio Palomba, and, Kristin Rozier for their comments at the validation stage of this study.

Footnotes

https://figshare.com/s/40bb82bba792d80bdbfa.

https://cordis.europa.eu/project/id/956200.

https://jpl7.org/.

References

[1]

Mohamed AbdElSalam, Keroles Khalil, John Stickley, Ashraf Salem, and Bruno Loye. 2019. Verification of advanced driver assistance systems (ADAS) and autonomous vehicles with hardware emulation-in-the-loop. Int. J. Automot. Eng. 10, 2 (2019), 197–204.

Abstract

1 Introduction

1.1 Motivation

1.2 Scope and Audience

1.3 Research Questions

1.4 Structure of the Article

2 Related Work

3 Background AND Rationale

3.1 Motivation

3.2 Robotic and Autonomous System

3.3 Testing and the SERP-Test Taxonomy

4 Methodology

4.1 Seed Papers

4.2 Selection Strategy

4.2.1 Inclusion Criteria.

4.2.2 Exclusion Criteria.

4.3 Taxonomy

4.4 Search Strategy

4.4.1 Initial Query.

4.4.2 Validated Query.

4.5 Overview of the Results

5 Results

5.1 RQ1: Models

5.1.1 Modelling Properties.

5.1.2 Modelling System Behaviour or Structure.

5.2 RQ2: Effect

5.2.1 Measures for Interventions.

5.2.2 Measures for Subject Systems.

5.3 RQ3: Tooling

5.3.1 Context Tools.

5.3.2 Effect Tools.

5.4 RQ4: Applicability

6 Suggestions AND Recommendations to Study Audience

6.1 Analysis

6.1.1 Domain.

6.1.2 Cooperation and Connectivity.

6.1.3 Testing Strategy.

6.2 For Researchers

6.3 For Practitioners

7 Conclusion

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Automatic Generation of Test Oracles—From Pilot Studies to Application

Verification and Validation of Adaptive Instructional Systems: A Text Mining Review

Recognising Assumption Violations in Autonomous Systems Verification

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations