research-article

Public Access

A Survey on Automated Driving System Testing: Landscapes and Trends

Authors:

Yan Guo,

Lei Ma,

Yang LiuAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 124, Pages 1 - 62

https://doi.org/10.1145/3579642

Published: 24 July 2023 Publication History

PDF eReader

Abstract

Automated Driving Systems (ADS) have made great achievements in recent years thanks to the efforts from both academia and industry. A typical ADS is composed of multiple modules, including sensing, perception, planning, and control, which brings together the latest advances in different domains. Despite these achievements, safety assurance of ADS is of great significance, since unsafe behavior of ADS can bring catastrophic consequences. Testing has been recognized as an important system validation approach that aims to expose unsafe system behavior; however, in the context of ADS, it is extremely challenging to devise effective testing techniques, due to the high complexity and multidisciplinarity of the systems. There has been great much literature that focuses on the testing of ADS, and a number of surveys have also emerged to summarize the technical advances. Most of the surveys focus on the system-level testing performed within software simulators, and they thereby ignore the distinct features of different modules. In this article, we provide a comprehensive survey on the existing ADS testing literature, which takes into account both module-level and system-level testing. Specifically, we make the following contributions: (1) We survey the module-level testing techniques for ADS and highlight the technical differences affected by the features of different modules; (2) we also survey the system-level testing techniques, with focuses on the empirical studies that summarize the issues occurring in system development or deployment, the problems due to the collaborations between different modules, and the gap between ADS testing in simulators and the real world; and (3) we identify the challenges and opportunities in ADS testing, which pave the path to the future research in this field.

1 Introduction

With the aim of a bringing convenient driving experience, increasing driving safety and reducing traffic congestion, automated driving systems (ADS) (a.k.a. self-driving cars) have attracted significant attention from both academia and industry. According to the statistics from a recent report [1], the autonomous car market was valued for more than 22 billion dollars in 2021. However, the state-of-the-practice ADS are still vulnerable to numerous safety and security threats, due to either complicated external environments or deliberate attacks from various sources. These threats may lead to system failure, which could bring catastrophic consequences and unacceptable losses [2]. Despite the rapid progress that has been made so far, safety assurance of ADS is still a major challenge to their full-scale industrialization. Some recent news, e.g., the report of Tesla’s fatal accident [3], further highlights the importance of research in the safety assurance of automated driving.

In general, an ADS is composed of several modules for the functionalities of sensing, perception, planning, and control. The sensing module collects and preprocesses the environmental data using a number of intelligent sensors, such as camera, radio detection and ranging (radar), and light detection and ranging (LiDAR). The perception module extracts information from the sensors to understand the environmental conditions, including road conditions, obstacles, and traffic signs. Based on the output of the perception module, the planning module generates the optimal driving trajectories that are expected to be followed by the ADS. Last, the control module sends the lateral and longitudinal control signals to drive the ADS along the planned trajectories. In particular, some ADS adopt a special end-to-end design that integrates the perception, planning, and control functionalities in a single module. These modules collaborate with each other and jointly decide the behavior of the ADS [4]; the abnormal function of any module can lead to system failures, which severely threatens the safety and security of the ADS.

Testing has been an effective approach to exposing potential problems and ensuring the safety of systems. However, the testing of ADS is known to be extremely challenging, due to the complexity and multidisciplinarity of those systems. In recent years, there have been a surge of studies that focus on ADS testing. These published papers span over mainstream venues in various domains, such as transportation venues (e.g., International Conference on Intelligent Transportation Systems (ITSC) and IEEE Intelligent Vehicles (IV)), software engineering venues (e.g., International Conference on Software Engineering (ICSE) and International Conference on Automated Software Engineering (ASE)), artificial intelligence venues (e.g., Computer Vision and Pattern Recognition Conference (CVPR) and AAAI Conference on Artificial Intelligence (AAAI)), and security venues (e.g., ACM Conference on Computer and Communications Security (CCS) and USENIX Security Symposium), which tackle the challenges in ADS testing from various perspectives (see the detailed statistics and analysis in Section 3.2). Numerous testing approaches have been proposed for solving different problems, and numerous bugs and vulnerabilities have been reported to facilitate the system reengineering that repairs the existing problems and ensures the system safety.

To better understand the landscape of ADS testing, there have been a number of surveys [5, 6, 7, 8, 9] that summarize the recent advances in this field. Grigorescu et al. [5] investigate the deep learning techniques for different modules of ADS and discuss the safety risks of these techniques. Rosique et al. [6] analyze the characteristics of the common sensors used for perception, as well as the performance of different simulators for the simulation of perception systems. Zhang et al. [7] present a literature review on the techniques for identification of critical scenarios, in which they point out the necessity of combining different scenario identification methods for safety assurance of ADS. Zhong et al. [8] review the works about scenario-based testing in high-fidelity simulators, and discuss the gap between the virtual environment and the real-world environment. Jahangirova et al. [9] propose a set of driving quality metrics and oracles for ADS testing and demonstrate the effectiveness of combining the 26 best metrics as the functional oracles.

Most of the existing surveys view the ADS under study as a whole and investigate the methodologies of ADS testing from the perspective of the system level. In that case, as a typical problem setting, ADS testing consists in generating critical scenarios that can lead to system failures, such as collisions with obstacles. In addition, because of the high cost of testing ADS in the real world, most of the studies in these surveys adopt software simulators as the testing environments. While these surveys are useful, they are not sufficient to show the comprehensive landscape of ADS testing. Indeed, since ADS are complex and composed of multiple modules that differ from each other in technical design, their testing should capture the features of different modules and address the challenges in different domains. Moreover, at the system level, the testing should concern the problems arising from the collaborations between different modules and highlight the gap between simulation-based testing and real-world testing.

Contributions. To bridge this gap, we conduct a survey on ADS testing that focuses on both module-level testing and system-level testing. Specifically, at the module level, we reveal the distinction of the testing techniques for different modules due to their different features; at the system level, we focus on the challenges introduced by the cooperation between different modules and also discuss the different levels of realism of the testing environments. In particular, we answer the following research questions in this survey:

•

RQ1: What are the techniques adopted for testing different modules of an ADS?

•

RQ2: What are the techniques adopted for system-level testing of ADS?

•

RQ3: What are the challenges and opportunities in the field of ADS testing?

To answer these questions, we make the following contributions in this article:

•

To answer RQ1, we survey the testing techniques for the different modules of ADS, and in particular, we highlight the technical differences in these testing techniques due to the characteristics of different modules;

•

To answer RQ2, we survey the system-level testing techniques, with a focus on the following:

—

First, we review the existing empirical studies on the issues/bugs in public reports and repositories. These studies reveal the system issues occurring in their development or deployment and show a bird’s-eye view on the potential system problems without running them;

—

Second, we study the existing investigations on the safety problems at the system level, when different modules collaborate and interact with each other during the running of the systems;

—

Third, we focus on the gap between simulation-based testing and real-world testing, which is an emerging topic of great importance, to understand the quality of testing.

•

To answer RQ3, we identify the challenges and potential research opportunities for ADS testing, based on our survey results.

To the best of our knowledge, our work is the first one that unveils the intrinsic differences and challenges in ADS testing w.r.t. different modules; meanwhile, we give a specific emphasis on the comparison between the currently popular simulation-based testing and real-world testing. Moreover, our analysis and discussion on the challenges and opportunities exhibit the landscapes and stimulate future research in this important field.

Paper organization. The rest of the article is organized as follows: Section 2 overviews the background of the ADS; Section 3 describes the survey methodology, including the detailed scope, collection process, and collection results. The main results of this survey are in Sections 4, 5, and 6. In Section 4, we survey the literature of empirical study on ADS testing; in Section 5, we survey the literature of techniques on module-level ADS testing and answer RQ1; and in Section 6, we survey the literature of techniques on system-level ADS testing and answer RQ2. We then show the statistics and analysis of the works in Section 7. We summarize the challenges and potential research directions in Section 8 and answer RQ3. Last, we conclude this survey in Section 9.

2 Preliminaries

Nowadays, autonomous systems have been deployed in various application domains, such as transportation, robotics, and healthcare, and they have made huge differences to our daily lives. In this work, we pay particular attention to ADS, i.e., self-driving cars, as a typical example to exemplify the concerns in the quality aspects of those systems. In this section, we first provide an overview of the categorization of ADS according to the levels of automation, from L0 to L5; then we show the general architecture of ADS; last, we showcase four open source ADS and one simulation platform.

2.1 Overview of Automated Driving Systems

According to the complexity and variety of the ADS, the society of automotive engineers (SAE) proposed the taxonomy and definitions of driving automation systems, known as SAE J3016,¹ which has become a classification standard in recent years. This standard categorizes driving automation systems into six levels, ranging from no driving automation (Level 0) to full driving automation (Level 5). These levels are usually referred to as L0 to L5.

The definitions of the systems from L0 to L5 are as follows:

(1)

L0 systems only perform warnings and momentary interventions, such as Lane Departure Warning (LDW) and Automated Emergency Braking (AEB), and the drivers need to perform all of the dynamic driving tasks (DDT);

(2)

L1 systems support steering or acceleration/deceleration for drivers; example features include Automated Lane Centering (ALC) and Adaptive Cruise Control (ACC);

(3)

L2 systems perform steering and acceleration/deceleration at the same time, and a typical L2 system should support both ALC and ACC;

(3)

L3 systems can execute responses to driving conditions within Operational Design Domain (ODD), which is an operational restriction imposed to the ADS at the design stage, but these systems require fallback-ready people to handle system failures; an example is a traffic jam chauffeur;

(4)

L4 systems can further support the system fallback, and an example is a local driverless taxi;

(6)

and L5 systems can handle all driving conditions.

These are shown in Table 1.

Table 1.

Level	Name	Description	Example
0	No Driving Automation	Drivers perform all of the DDT	LDW
1	Driver Assistance	The system performs part of the DDT: either steering or acceleration/deceleration	ALC or ACC
2	Partial Driving Automation	The system performs part of the DDT: steering and acceleration/deceleration	ALC and ACC
3	Conditional Driving Automation	Drivers or fallback-ready users need to be receptive to ADS-issued requests	Traffic jam chauffeur
4	High Driving Automation	The system performs all of the DDT and DDT fallback within a specified ODD	Local driverless taxis
5	Full Driving Automation	The system performs all of the DDT and DDT fallback without ODD limitation	Full autonomous vehicles

Table 1. Automation Levels and Definitions by SAE

A system in L0 to L2 is also known as an advanced driver assistance system (ADAS), since it is only in charge of a part of the DDT, such as lateral control or longitudinal control, and the safety of the whole vehicle still relies on drivers. In contrast, a system in L3–L5 performs all of the DDT and drivers are not expected to interfere during the driving process, so it realizes the real automated driving.

Note that there exist other identified synonyms of Automated Driving, e.g., autonomous driving, self-driving, but in this article, we follow the relevant terminology from SAE J3016, in which the term “ADS” refers to Automated Driving System. In the literature [8], an ADAS is usually referred to as a system that belongs to L0–L2, while an ADS is referred to as a system that belongs to L3–L5. In this survey, since many testing techniques are independent of the automation levels of the systems under test, we sometimes mix the use of the terms and, by ADS, we refer to the systems over all of the levels of driving automation.

2.2 Architecture of ADS

A common ADS is composed of four functional modules, namely the sensing module, the perception module, the planning module, and the control module, as shown in Figure 1. In the sensing module, intelligent sensors (e.g., camera, radar, and LiDAR) are used to collect the driving context from the physical world. The perception module extracts useful environmental information from the sensor data and sends it to the planning module for motion planning. Based on the information, the planning module generates the optimal driving trajectory. Last, the control module outputs the control commands to drive the vehicle along the trajectory. Moreover, some modern ADS adopt a special design named end-to-end module. In the remainder of this section, we elaborate on the functionalities of these modules in the typical architecture of an ADS. Sensing module. By adopting various physical sensors, the sensing module takes charge of collecting and preprocessing driving environmental information from the physical world. The common sensors used by an ADS include Global Positioning Systems (GPS), inertial measurement units (IMU), cameras, radar, and LiDAR. Specifically, GPS provides the absolute position data (e.g., latitude, longitude, and heading angle) while IMU provides the temporal data (e.g., acceleration and angular velocity). The combination of these two sensors can provide more accurate real-time positioning of the autonomous vehicles. Cameras are used to record and capture visual information on the driving road for the perception module, and radar is used to detect obstacles by radio waves. Nowadays, LiDAR has become an indispensable component in many leading ADS (e.g., Apollo and Autoware), since it can collect three-dimensional (3D) point cloud data and process it with higher measurement accuracy. In comparison with camera sensors that are sensitive to light conditions (e.g., shadows and bright sunlight), LiDAR sensors are more robust under these environments, and the generated 3D point cloud can be further utilized to build 3D models that better characterize the surrounding objects.

Fig. 1.

Perception module. With the help of deep learning techniques, the perception module processes sensor data (e.g., pictures and 3D point cloud) from the sensing module to accomplish a series of perception tasks, such as localization, detection, and prediction.

•

Localization provides the real-time location of the ADS during the driving process. Furthermore, localization is mostly implemented by fusing the data from GPS, IMU, and LiDAR. Specifically, the 3D point cloud data of LiDAR are used to match the features stored in a High-Definition (HD) map to determine the most likely location.

•

Detection includes lane detection, traffic light detection, and object detection. The data of camera are often used for lane detection and traffic light detection, while the data of camera, radar, and LiDAR are often fused by several algorithms (e.g., extended Kalman filters [10]) for object detection. These detection tasks are mostly implemented by using deep neuron networks (DNNs), such as Faster R-CNN [11] and Yolov3 [12].

The prediction task also benefits from the perception module and is mainly used for trajectory planning. We leave the introduction of this task below in the planning module.

Planning module. By using DNNs and planning algorithms, the planning module takes perception data as input and makes decisions for the control module to control the vehicle. It has two submodules, namely the prediction submodule and the planning submodule.

•

The prediction submodule estimates the future trajectories of the moving objects (e.g., vehicles and pedestrians) detected by the perception module. For a given moving object, the possibility of its path is often evaluated by machine learning (ML) algorithms, e.g., LSTM and RNN.

•

The planning submodule generates the optimal driving trajectory for ego vehicle based on the prediction results. Specifically, this module is responsible for three tasks, namely route planning, behavior planning, and motion planning.

—

Route planning selects the optimal path for the vehicle by using path algorithms, such as Dijkstra and A*;

—

Behavior planning makes decisions for the actions taken by the ADS, such as lane changing and car following, based on the system requirements and traffic rules;

—

Motion planning generates velocity and steering angle plans that are locally optimal by considering several factors, including safety, efficiency, and comfort.

Control module. Based on the trajectories planned by the planning module, the control module finally takes charge of the longitudinal and lateral control of the vehicle. By using control algorithms (e.g., proportional integral derivative (PID) control [13] and model predictive control (MPC) [14]), this module generates appropriate control commands (e.g., steering and braking) and sends them to the related hardware, i.e., the electronic control unit (ECU), via the protocol of controller area network (CAN) bus. Note that this module is critical for several functionalities provided by the ADS, including ACC, AEB, and Lane Keeping Assistance.

End-to-end module. As shown in Figure 1, besides the common modules mentioned above, there exists another end-to-end design that combines the perception, planning, and control processes in one module. To be specific, this module mainly consists of special deep learning models, which are trained by labeled data that maps information from sensors directly to the corresponding control commands. Consequently, these models could output the control commands based on the current driving environment.

2.3 Open Source Systems and Tool Stacks

In this section, we introduce four open source ADS, namely Apollo, Autoware, OpenPilot, and Pylot, and one simulation platform called BeamNG [15]. The first three ADS have been widely adopted for commercial usage in practice [16, 17, 18], and the last ADS is from academia.

Apollo . Apollo has been a popular open source ADS developed by Baidu since 2017; as of December 2021, it has been updated to version 7.0.0. The hardware platform of Apollo includes camera, LiDAR, millimeter wave radar, and Human-Machine Interface (HMI) device, and currently the communications over different components are managed by CyberRT [19]. The functionalities of Apollo include cruising, urban obstacle avoidance, and lane changing.

Autoware . Autoware is another open source L4 ADS developed by the research group of Nagoya University in 2015. Though it is mainly applicable to urban roads, it also suits highways and other road conditions. By using the sensors introduced in Section 2.2, it supports a series of functionalities including connected navigation, traffic light recognition, and object tracking. Unlike Apollo, which uses CyberRT, Autoware adopts robot operating system (ROS) [20] for communications over different components.

OpenPilot . OpenPilot is a popular open source L2 ADAS developed by Comma.ai, and it has been updated to version 0.8.12 as of December 2021. OpenPilot supports common L2 features, such as ACC, ALC, and Forward Collision Warning. Unlike other L2 ADAS, OpenPilot has a high degree of portability—it can be compatible with more than 120 types of vehicle models by using related hardware set (e.g., Car Harness [21] and Comma Two [22]).

Pylot . Pylot [23] is a modular and open source autonomous driving platform developed by UC Berkeley in 2021. For achieving the tradeoff between latency and accuracy, it is built on a deterministic dataflow system called ERDOS [24]. Pylot also has other built-in features such as modularity, portability, and debuggability, which allow researchers to implement or test ADS functions with higher efficiency.

BeamNG . BeamNG [25] is a popular image-generating simulation platform, which has been widely used in the Search-Based Software Testing competition [26]. Specifically, it is based on a physically accurate engine that can support customized vehicle models and realistic damage. For example, different components of a vehicle can have different degrees of deformation after a collision. In addition, BeamNG also contains a driving agent called BeamNG.AI [27], which could take over one or more vehicles and drive in several different modes.

3 Paper Collection Methodology AND Result

In this section, we introduce our paper collection methodology in Section 3.1 and present the statistics and analysis of the results in Section 3.2.

3.1 Paper Collection Methodology

This section introduces the methodology adopted in our paper collection process, which is illustrated in Figure 2. The overall process consists of five main steps, namely database search, abstract analysis, full-text analysis, backward & forward snowballing, and data extraction. The intermediate results of each step during the process are reported in our supplementary website² and are also available in Zenodo [28]. We now describe the details of each step as follows.

Fig. 2.

3.1.1 Database Search.

This step aims to find the potentially relevant papers by searching in electronic databases. Specifically, we select DBLP³ as our database, which is a popular bibliography database containing a comprehensive list of research venues in computer science. Moreover, our search targets the titles of the papers, since the title often conveys the theme of a paper. We optimize the search string in an iterative manner to collect as many related papers as possible. The final search string used during our search process is shown as follows:

((“automated vehicle” OR “automated driving” OR “autonomous car” OR “autonomous vehicle” OR “autonomous driving” OR “self-driving” OR “driver assistance system” OR “intelligent system” OR “intelligent vehicle” OR “intelligent agent”)

AND

(“test” OR “attack” OR “validation” OR “evaluation” OR “quality assurance” OR “quality assessment” OR “oracle” OR “mutation” OR “fuzzing”))

The first group of terms (above “AND”) represents the identified synonyms of automated driving, which contains the terms such as “autonomous driving” and “self-driving”; the second group of terms (below “AND”) contains the common phases in the process of quality assessment of software systems (e.g., “test” and “validation”), along with popular testing approaches (e.g., “mutation” and “fuzzing”) and a keyword “oracle,” which is a significant concept in software testing. The terms in each group are connected with OR operator, while the two groups are connected with AND operator, which means that a relevant paper should cover the characteristics of both groups. Overall, the application of the above search string on DBLP retrieves 1,185 papers. After removing 144 duplicates, the final number of papers we collected is 1,041.

3.1.2 Abstract Analysis.

To determine whether each primary candidate paper is relevant to ADS testing, we perform a manual analysis on the abstracts of 1,041 papers obtained from database search in Section 3.1.1. This process is conducted by two assessors, i.e., the first two authors, following the inclusion and exclusion criteria formulated as follows:

•

Inclusion Criteria:

IC1

papers that propose a method for testing the modules of ADS or the whole system;

IC2

papers that introduce metrics as test oracles or adequacy criteria for testing the modules of ADS or the whole system;

IC3

papers published between January 2015 and June 2022.

•

Exclusion Criteria:

EC1

preprint papers or non-peer-reviewed papers;

EC2

early results or preliminary studies;

EC3

papers that do not target ADS;

EC4

survey papers or summary papers;

EC5

papers that do not focus on assessing quality aspects of ADS or its components;

EC6

papers that focus on other quality aspects such as HMI, cyber security, and adversarial defense.

Specifically, for IC2, test oracle refers to the metrics that measure whether an ADS or its components misbehave, and test adequacy refers to the criteria that judge whether a test suite has been sufficient for testing; for EC3, papers related to other intelligent systems, e.g., unmanned aerial vehicle, are excluded, since we mainly focus on automated driving systems; for EC4, such relevant papers are discussed in Section 1 for a comparison with our survey; for EC5, papers that do not report novel techniques or metrics for ADS testing are excluded, e.g., the papers that focus on the engineering implementation of testbeds; for EC6, only the studies that assess quality aspects, e.g., safety and security of the ADS or its modules, are considered.

As a result of manual analysis of the abstracts, the two assessors fully agree on the inclusion of 101 papers and have divergent opinions on 54 papers, i.e., those included by one assessor but excluded by the other. To cover more relevant studies, in this step, papers included by either one assessor or both assessors are all added into a tentative inclusion set, which contains 155 papers.

3.1.3 Full-text Analysis.

In this step, we download the papers in the tentative inclusion set and conduct a full-text analysis. For those papers included by both assessors, we further analyze the introduction, conclusion, or other parts to determine whether a certain paper proposes an approach or a metric for ADS testing. If the assessors’ decisions conflict, then the two assessors will first review the inclusion and exclusion criteria defined in Section 3.1.2 and have a discussion. In cases where the conflict still exists, a senior researcher will join the discussion and resolve the dispute. After an agreement on removing 54 irrelevant papers, the number of the inclusion set is 101.

3.1.4 Backward and Forward Snowballing.

To reduce the risk of missing relevant papers, we perform both backward snowballing and forward snowballing [29] on the 101 papers in the inclusion set, and the process is assigned to the two assessors. In backward snowballing, they check the reference list in the existing studies to obtain candidate papers, while in forward snowballing, they use Google Scholar⁴ to access the papers that cite the existing studies. For those candidate papers produced by snowballing, the two assessors apply the inclusion criteria and exclusion criteria defined in Section 3.1.2 and conduct both Abstract Analysis in Section 3.1.2 and Full-text Analysis in Section 3.1.3 to identify the relevant papers that could be added into the inclusion set. As a result of performing snowballing for one iteration, we identify 57 new papers that are relevant to ADS testing. Hence, the number of papers in the inclusion set after snowballing is 158. To avoid missing relevant papers that may not be obtained through our collection process, we also ask for feedback from domain experts and collect 23 papers as a result. Finally, we collect 181 papers for data extraction.

3.1.5 Data Extraction.

In this step, all the resulting 181 papers are thoroughly read by the authors. Specifically, the authors need to identify the testing target (ADS module or system) and the proposed method or metrics for ADS testing. The identified information is then extracted into a data extraction form. Since the data extraction process requires careful reading of each paper, this task is conducted by three authors as the assessors to share the overall workload. Each assessor is assigned more than 50 papers, and to ensure accuracy, the extracted information from the three assessors is all reviewed in parallel by another author. All conflicting decisions are resolved in the discussion at this stage.

3.2 Paper Collection Results

In this section, we analyze the collected papers from three perspectives, namely the publication venues, the targeted system modules, and the publication years. We show the distribution of the publication venues of all the papers in Figure 3(a) and the distribution of the targeted modules/system in Figure 3(b). Moreover, we present the number of papers published in different years in Figure 3(c).

Publication venues. In Figure 3(a), we can see that (i) many of the papers, up to 38 \(\%\) , are published in transportation venues such as ITSC and IV; (ii) 25 \(\%\) of the papers are published in software engineering venues such as ICSE and ASE; (iii) the adversarial attack methods for vulnerability detection of ADS are related to the security of the systems, and hence 10 \(\%\) are published in the security venues, such as CCS and USENIX Security Symposium; and IEEE Symposium on Security and Privacy; and (iv) since the ADS and artificial intelligence are closely related, 7 \(\%\) of the papers are published in artificial intelligence venues such as CVPR and AAAI.

Fig. 3.

Target modules. In Figure 3(b), we can see that, obviously, the papers on system-level testing dominate the largest percentage, up to 54 \(\%\) . These papers involve testing techniques that span over both simulation-based testing and mixed-reality testing. Moreover, the number of the papers concerning with the perception module is the second largest, up to 22 \(\%\) . The perception module takes charge of object detection and image semantic segmentation using deep learning, which is important but vulnerable to safety and security threats and thus becomes a popular research direction. Compared to the perception module, there are fewer papers concerning other modules, such as the planning module or the control module.

Publication year. In Figure 3(c), we can see that the number of the papers related to ADS testing shows a general ascending trend, from 2015 to 2021. This trend indicates that the safety and security of ADS are attracting more and more research attention from researchers. The reason for fewer relevant papers in 2022 is that only partial papers had been published by the time we collected the papers.

4 Literature of Empirical Study On ADS Testing

In this section, we provide an overview of the papers that perform empirical study in the field of ADS testing. By the term of empirical study, we mean that, instead of executing the systems in a simulated or real-world environment, these studies perform empirical analysis based on existing databases, such as project repositories and public crash reports. In general, empirical study is an essential step before the experimental ADS testing, since it provides experiences and insights in the distribution of potential safety risks.

We classify these studies into three categories, namely system study, bug/issue study, and public report study. System study, shown in Section 4.1, mainly analyzes the architectures of ADS and is thus beneficial for understanding the system behavior before running it. Bug/issue study, shown in Section 4.2, focuses on collecting and analyzing the bugs and issues of ADS, which are usually raised by users, developers, and researchers and published in project repositories. Public report study, shown in Section 4.3, refers to the analysis on those real-world disengagements and crashes reported in various databases (e.g., the crash reports released by the California Department of Motor Vehicles (CADMV) [30]). These reports target at real-world system failures, and they provide important references for understanding system reliability in the real world.

4.1 System Study

Because of the high complexity of the system architectures of ADS, it is necessary to have a comprehensive understanding of the systems before performing their evaluation. The system studies, e.g., on Apollo [31], build the logical architectures that disclose the connections over different modules in ADS. As a result, these lines of work can bring insights into the potential vulnerabilities and suggest useful metrics for system testing.

Peng et al. [31] investigate the collaboration between the code and the DNN models in Apollo; specifically, they study which roles are played by the code and the underlying DNN models, respectively. They find that the 28 DNN models used in Apollo interact with each other in diverse ways, e.g., the output of one DNN can be used as the input of another DNN, and the outputs of multiple DNNs can be combined as the input of another DNN. Moreover, the code also plays an important role in the system workflow, e.g., it can be used for filtering out invalid output of DNNs, and it can complement the imperfect outcome of DNNs.

4.2 Bug/Issue Study

For those open source ADS, issues and bugs reported in their public repositories (e.g., GitHub) reflect real problems encountered by users and developers during the development and deployment. Therefore, systematic analysis on these issues [32, 33] can provide insights into the root causes of system failures. In this section, we review two studies [34, 35] in the field of ADS testing.

Garcia et al. [34] present a comprehensive study of bugs in two ADS, namely Apollo and Autoware. Specifically, they collect bugs from the commits across the Apollo and Autoware repositories in GitHub and perform a manual analysis on these bugs and commits. As a result, they obtain 13 root causes (e.g., algorithm, data, memory) for system crashes, 20 symptoms (e.g., speed and velocity control, vehicle trajectory) and 18 bug-related components (e.g., perception, planning, control), based on their analysis of 499 bugs in the two ADS.

Tang et al. [35] perform a study on issue analysis for OpenPilot. They collect 235 bugs from 1,293 pull requests and 694 issues of the OpenPilot project in GitHub and Discord.⁵ These bugs are then classified into five categories, including (DNN) model bugs, plan/control bugs, car bugs, hardware bugs, and UI bugs. Among these different types of bugs, they find that the car bugs related to the interface with different car models dominate \(31.48\%\) , and plan/control bugs related to the control of car behaviors account for \(25.95\%\) .

4.3 Public Report Study

The following works all perform analysis on public reports, i.e., CADMV [30], which is a database involving disengagement and crash records on public roads. Specifically, a disengagement refers to a failure that requires a human driver to take over control of the vehicle; a crash refers to a collision with other traffic participants. These empirical studies investigate the relevant factors, such as the causes, the correlations, and the impacts of these system failures, and they also shed light on future system developments.

Analysis of disengagement reports. In References [36, 37, 38], the authors analyze the disengagements based on different metrics. Lv et al. [36] classify the disengagement events into two types, namely active disengagement and passive disengagement, and investigate the root causes of each group. Boggs et al. [37] apply the binary logistic regression [39] to categorize the cause of the disengagements in more detail. The results show that the planning discrepancy (e.g., improper localization, motion planning) accounts for \(41\%\) of ADS disengagements. Khattak et al. [38] investigate the relationship between disengagements and crashes and find relevant factors that could increase the likelihood of a disengagement without a crash.

Analysis of crash reports. References [40, 41, 42, 43, 44, 45, 46] analyze the crash reports and identify the contributing factors. Leilabadi et al. [40] apply text analysis to the crash reports, and they find that the crashes mostly occur when vehicles run in the automated mode, and the most frequent ADS crash type is the rear-end collision. Favaro et al. [41] focus on the dynamics aspect and present the speed distribution of those crash vehicles. Wang et al. adopt regression and classification tree to investigate the types and severity of these crashes. They find that the severity increases significantly when an automated vehicle is responsible for the event. Das et al. [43] utilize Bayesian latent class model to perform the analysis and identify six collision patterns. Aziz et al. [44] investigate both crash data involving ADS and without ADS and build a spatial-temporal mapping of the contributing factors between them. Song et al. [45] conclude that the most representative crash pattern is the “collision following ADS stop,” i.e., an automated vehicle stops suddenly and gets hit by other vehicles on the road. Besides CADMV, the crash data in other databases such as UK’s STATS19 [47] are also analyzed with statistical approaches [46, 48].

4.4 Discussion

Table 2 summarizes the collected papers that empirically study the issues in ADS testing. Several existing system studies focus on Apollo, and there are also studies that cover other open source ADS, such as Autoware and OpenPilot. Moreover, there are many works [36, 37, 38, 40, 41, 43, 44, 45, 46, 49] that target the disengagement and crash reports for identifying the root causes or failure types, as these investigations are critical to understanding the ADS safety performances in the real world.

Table 2.

Category	Description	Literature
System study	Introducing the interaction between the code and the DNN models in Apollo	[31]
Bug/issue study	Finding the root causes, symptoms, and bug-related components based on analysis on bugs of Apollo and Autoware	[34]
Bug/issue study	Performing categorization and analysis on bugs of OpenPilot	[35]
Public report study	The analysis and classification of the disengagements based on different perspectives (e.g., modules)	[36, 37, 38]
Public report study	The identification of the common crash types by different methods (e.g., text analysis)	[40, 41, 42, 43, 44, 45, 46, 48]

Table 2. Summary of the Papers for Empirical Study on ADS Testing

Summary: Many empirical studies focus on studying the systems themselves, e.g., \Apollo, to understand the characteristics of the systems or bugs/issues from their public project repositories. There are also many studies that analyze the public crash reports to understand the safety problems of ADS in the real world.

5 Literature of Techniques On Module-level ADS Testing

In this section, we introduce the works on module-specific testing for ADS with the goal of answering RQ1 in Section 1. These modules under test include the ones that have been introduced in Section 2.2, namely the sensing module (in Section 5.1), the perception module (in Section 5.2), the planning module (in Section 5.3), the control module (in Section 5.4), and the end-to-end module (in Section 5.5).

We introduce these studies from three perspectives, namely test methodology, test oracle, and test adequacy. Concretely, (i) test methodology introduces various methods or technical innovations for testing; (ii) test oracle defines metrics that can be used to judge whether the module behaves correctly; and (iii) test adequacy proposes coverage criteria that tell if the test cases in a test suite are sufficient.

Note that, due to the different features of different modules, it can be the case that, for a specific module, not all of the three perspectives are identified as important scientific topics, so we may only introduce the related literature from only a part of the perspectives.

5.1 Sensing Module

The sensing module is the frontier module of an ADS and the performance of the physical sensors (e.g., camera, radar, and LiDAR) in this module is critical to the safety and security of the whole ADS. Relevant studies on the test methodology of this module can be divided into physical testing (shown in Section 5.1.1) and deliberate attack (shown in Section 5.1.2). Physical testing aims to test the performance of the sensors under different physical conditions, while deliberate attack interferes with the input signals of the sensors to diminish the sensing quality.

5.1.1 Physical Testing.

Physical testing [50, 51] aims to assess the sensors’ capabilities of handling specific tasks under different physical environments, such as harsh weather conditions. Kutila et al. [50] perform a detection distance testing of LiDAR in the foggy and snowy conditions. The results show that the maximum measurable distance by the LiDAR decreases by 20–40 m under harsh weather conditions. They also compare the detection capability of LiDAR with different wavelengths in their follow-up work [51]. Concretely, they test the detection accuracy of LiDAR at 905- and 1,550-nm wavelengths in foggy and rainy weather, and the results indicate that the LiDAR with a larger wavelength can detect the environment more accurately when the visibility is low.

5.1.2 Deliberate Attack.

Unlike physical testing, deliberate attack refers to the intentional attacks launched by human attackers. This type of attack on the sensors of ADS can be classified into jamming attack and spoofing attack, which are introduced below.

Jamming attack. This is a basic type of attack on sensors by generating noises using specific tools to interfere with the sensors and damage their normal functionalities. Shin et al. [52] propose a blinding attack method against LiDAR by using intense light with the same wavelength as the target sensor. Yan et al. [53] utilize a laser to cause irreversible damage to cameras and an ultrasonic jammer to interfere with ultrasonic sensors. Another attack on ultrasonic sensors [54] works by placing an ultrasonic sensor opposite to the target sensor.

Spoofing attack. Spoofing attack is performed by injecting fake data to deceive sensors. Meng et al. [55] and Zeng et al. [56] spoof the GPS receivers to a wrong destination by modifying the raw signals of these sensors. Komissarov et al. [57] utilize a Software Defined Radio to fool the mmWave radar, e.g., they make it produce the wrong measurement of vehicle speed. Wang et al. [58] first utilize the features of infrared lights to perform spoofing attack. Specifically, the proposed approach could create invisible objects with simple LEDs to fool the camera sensors and thus introduce localization errors to the vehicles.

5.1.3 Discussion.

Table 3 shows the summary of the papers for the sensing module testing. It can be seen that the existing physical testing works [50, 51] mainly focus on testing LiDAR sensors under different weather conditions, e.g., the foggy weather, the snowy weather. This is because the LiDAR sensor has become a key component in ADS, and its robustness is of great significance to the vehicle’s safety. Besides, we find more works that perform deliberate attack including jamming attack [52, 53, 54] and spoofing attack [55, 56, 57, 58] on other physical sensors. With the usage of specific devices, e.g., lasers [53] and LEDs [58], these two types of attacks have been demonstrated to be effective for finding abnormal behaviors of the target sensors.

Table 3.

Methodology		Description	Literature	Test Sensor
Physical testing		Testing the detection distance of LiDAR under different weather conditions	[50, 51]	LiDAR
Deliberate attack	Jamming attack	Using intense light to blind the sensors	[52]	LiDAR
		Utilizing a laser and a jammer to interfere with the sensors	[53]	Camera and ultrasonic sensors
		Placing an opposite ultrasonic sensor	[54]	Ultrasonic sensors
	Spoofing attack	Modifying the raw data	[55, 56]	GPS
		Utilizing a Software Defined Radio	[57]	Radar
		Creating invisible objects with simple LEDs	[58]	Camera

Table 3. Summary of the Papers for the Sensing Module Testing

Summary: Physical testing focuses on testing sensors under different weather conditions. There are more works performing deliberate attack on the sensing module, e.g., jamming attack and spoofing attack, with the usage of specific hardware devices.

5.2 Perception Module

The perception module receives and processes sensor data; based on that, it perceives external environments. The literature we collected includes the test methodologies (shown in Section 5.2.1), the test oracles (shown in Section 5.2.2), and the test adequacy criteria (shown in Section 5.2.3) for testing the DNN models used in the perception module of ADS.

5.2.1 Testing Methodology.

Adversarial attack is the major approach for testing the DNN models used in the perception module, which attempts to generate adversarial examples to trigger wrong inference results of perception. Based on the attacker’s knowledge about the target model, adversarial attacks can be classified into white-box attacks, in which the attackers have access to the training parameters of the target model, and black-box attacks, in which the attackers have limited or no knowledge of the model. Based on the attackers’ desired outcomes, there exist targeted attacks, in which the prediction that the model makes is limited to specific classes, and non-targeted attacks, in which the model can predict an arbitrary wrong class [59]. In general, there are three basic methods for performing adversarial attack, namely by solving an optimization problem, by leveraging the generative adversarial networks (GAN) [60] and by poisoning the training data. In the following, we introduce the literature that adopts these methods.

Optimization-based attack. We denote by F a DNN model, which takes as input a picture x and gives as output a label y. In general, an adversarial attack consists in solving the following optimization problem:

\begin{equation} \min \delta \qquad s.t. \quad F(x + \delta) = y^{*}, \quad y^{*}\ne y^{o}, \end{equation}

(1)

where \(\delta\) is a perturbation added to the picture x and \(y^{*}\) is a wrong label that is different from the correct label \(y^{o}\) . In other words, an adversarial attack involves finding the minimum perturbation that leads a DNN model to the wrong inference result. In most cases, the collected literature on adversarial attack follows this general framework; meanwhile, these papers also differ in their applications and motivations.

References [61, 62, 63, 64, 65, 66, 67] focus on performing adversarial attacks on camera-based perception tasks (e.g., object detection, traffic sign recognition, and semantic segmentation). Chen et al. [61] propose an attack method, called ShapeShifter, to generate perturbations against the object detector Faster R-CNN [11]. To make the perturbations more robust, they adopt the Expectation over Transformation technique [68] that adds random distortions iteratively, during the optimization process for generating perturbations. Zhao et al. [62] propose two approaches for generating adversarial perturbations: One is called hiding attacks, which can make object detectors unable to recognize objects, and the other is appearing attack, which can lead the object detectors to make incorrect recognition. Zhang et al. [63] propose an attack method for object detectors, which could generate camouflage on 3D objects, i.e., vehicles, and make it undetectable by target models. Unlike the classification loss adopted by most studies, Choi et al. [64] consider the object loss defined as the detector’s confidence on the existence of objects in an area. The adversarial perturbations generated by their approach could make the target object detector YOLOv4 [69] produce numerous false positives, i.e., those objects that do not exist in the clean images are unexpectedly detected. Xu et al. [65] perform an adversarial attack on the popular segmentation model DeepLab-V3+ [70]. The perturbations generated are quite small and can be stealthily projected to an unnoticed area in the original image. Li et al. [66] propose the first black-box attack on traffic sign recognition models, which could generate adversarial perturbations efficiently. Kumar et al. [67] present another black-box attack method on traffic sign recognition models. Instead of maximizing the loss of the correct class, they accelerate the convergence through minimizing the loss of the class that is incorrectly predicted by target models.

In addition to attacking the camera-based object detectors, there are also works [71, 72, 73, 74, 75, 76, 77, 78, 79] that focus on attacking the LiDAR-based 3D object detectors. Cao et al. [71] present a white-box attack method on a LiDAR-based perception module by adding the spoofed points into the original 3D point clouds. In the later work [72], their generated adversarial perturbations could fool both the camera and the LiDAR-based perception algorithms. Black-box attacks on the LiDAR-based object detectors are performed in Reference [73, 74, 75, 76] and experimental results show that the target models are highly sensitive to those adversarial 3D perturbations. Yang et al. [77] consider both white-box and black-box scenarios and generate perturbations for roadside objects such that they can be misidentified as vehicles by the perception module. Zhu et al. [78] generate perturbations for roadside objects but they target LiDAR-based semantic segmentation tasks. Unlike existing works that generate perturbations for 3D objects, Li et al. [79] add perturbations to the vehicle trajectories, and their method can result in a significant drop in the precision of the object detector, to nearly zero.

While the adversarial attack framework in Equation (1) is effective in fooling DNN models, it does not consider the realism of the perturbed pictures. There is literature that considers the adversarial attack problem under physical conditions. Eykholt et al. [80, 81] propose an attacking method, called Robust Physical Perturbations (RP \(_{2}\) ), that induces road sign classifiers to produce wrong classification results under real-world physical conditions, e.g., different viewpoint angles and different distances to the signs. Experimental results show that the attacked classifier misclassifies the traffic signs with a rate of 100 \(\%\) in the lab environment and 84.8 \(\%\) in the real world.

GAN-based attack. This type of attack [82] generates adversarial perturbations to fool a DNN model by training a GAN [60]. A GAN consists of two neural network models, namely a generator G and a discriminator D; specifically, G is used to generate perturbations and add them to an input image, and D is used to distinguish the generated image by G and the original image. The objective of training a generator G is to make the perturbed image of G indistinguishable by the discriminator D; this can be implemented by optimizing a loss function \(L_{G}\) . For fooling the target DNN, another loss function \(L_{D}\) is needed to stimulate the adversarial images produced by the GAN to be misclassified. As a result, the final objective function is formalized as follows:

\begin{align*} L \;=\;\gamma \cdot L_{G} + L_{D}, \end{align*}

where \(\gamma\) is a parameter that controls the relative importance of \(L_G\) and \(L_D\) .

Liu et al. [83] propose a GAN-based attack framework called perceptual-sensitive GAN, which generates adversarial patches with high visual fidelity. Experimental results show that the adversarial patches can significantly reduce the classification accuracy of the target DNNs. Xiong et al. [84] propose a multi-source attack method based on GAN, which generates adversarial perturbations that can fool both camera-based and LiDAR-based perception models. Yu et al. [85] utilize the cycle-consistent generative adversarial network (CycleGAN) [86] to synthesize corner cases for testing traffic sign detection models.

Trojan attack. This type of attack [87] is also called poisoning attack or backdoor attack. Specifically, it works by injecting malicious samples with trigger patterns into the training data of the target DNN models. Then the models can learn the malicious behaviors and make incorrect predictions when the inputs contain such triggers. The following works [88, 89] are all based on this idea.

Jiang et al. [88] utilize particle swarm optimization [90] to perform this type of attack on traffic sign recognition models. Experimental results show that the classification accuracy could drop to 62 \(\%\) due to only 10 \(\%\) injected training data. Ding et al. [89] propose the Trojan attack for deep generative models such as DeRaindrop Net [91], which is a GAN-based network for raindrops removal. Experimental results show that the model could be triggered to misclassify the traffic light or the value on the speed limit sign when it normally removes the raindrops.

5.2.2 Test Oracle.

A test oracle defines a metric used to distinguish between the expected and unexpected behavior of the system under test. Sometimes, an oracle is obviously identified; however, that is not always the case. In the perception testing, due to the huge input space (that involves all the possible input images) of the DNN models, it is a great challenge to specify the oracles for all the input images. We collect several types of test oracles that have been adopted for perception testing, namely ground-truth labeling [92, 93], metamorphic testing [94, 95, 96, 97, 98, 99], and formal specifications [100, 101], to judge whether a bug exists in the perception module.

Ground-truth labeling. The general approach of testing a DNN in the perception module is to match the inferred label by the DNN with the ground-truth label, given an image. Usually, these ground-truth labels are obtained by manual labeling. For instance, the ground-truth labels in References [102, 103] are produced in this way. However, manual labeling is notoriously expensive and laborious; to that end, automatically labeling methods are pursued by researchers. Zhou et al. [92] propose an automatic labeling method to detect the road component in the camera sensor images. Their method identifies the road component in the 3D point cloud captured by a LiDAR for the same scene and projects the identified area onto the corresponding image. The projected area labels the road component in the camera sensor images, which can be used for the validation of semantic segmentation models. Philipp et al. [93] propose another approach for automatically generating dimension and classification references for object detection. The dimension references are calculated by considering the occurred situations of each object and measuring the related features, e.g., projection angle, based on a given HD map. The classification references are generated by a decision tree, which considers the features such as kinematic behavior and the interaction with infrastructure elements of each object.

Metamorphic testing. Metamorphic testing [104] was introduced by Chen et al. to tackle the problem when the test oracle is absent in traditional software testing. Consider the testing of a program f that implements the trigonometric function \(\sin\) . Normally, for any input x, given the ground-truth value \(\sin (x)\) as the oracle for \(f(x)\) , we can assess the correctness of f by checking if \(f(x) = \sin (x)\) . However, assume that the ground-truth value \(\sin (x)\) is unknown. In this case, testing f by checking if \(f(x) = \sin (x)\) is not possible; instead, we can use the metamorphic testing that tests the program based on a metamorphic relation. For instance, in this case, a metamorphic relation can be built as \(f(x) = f(\pi - x)\) , due to the property \(\sin (x) = \sin (\pi - x)\) held by \(\sin\) . Hence, the correctness of f can be assessed by metamorphic testing, which consists in checking if \(f(x) = f(\pi -x)\) , for any input x.

Metamorphic testing has been studied for testing the perception module of an ADS; various metamorphic relations have been proposed over images [94, 95, 96, 97, 98] and frames in a scenario [99].

•

Metamorphic relations over images. Shao et al. [94] introduce a metamorphic relation in object detection, that is, the detected object in the original images should also be detected in the synthetic images. For testing traffic light recognition models, Bai et al. [95] propose another metamorphic relation, which states that, when traffic lights change from one color to another, the recognition results of the target models should change correspondingly. Zhou et al. [96] propose a metamorphic relation for LiDAR-based object detection, that is, the noise points outside the Region of Interest (ROI) should not affect the detection of objects within the ROI. Woodlief et al. [97, 98] check the model inconsistencies between original images and mutated images, e.g., the images of a vehicle with changed color.

•

Metamorphic relations over frames in a scenario. Ramanagopal et al. [99] propose two metamorphic relations, respectively for identifying temporal and stereo inconsistencies that exist in different frames of a scenario. The temporal metamorphic relation says that an object detected in a previous frame should also be detected in a later frame; the stereo metamorphic relation is defined in a similar way, for regulating the spatial consistency of the objects in different frames of a scenario.

Formal specifications. Recently, temporal logics-based formal specifications have been adopted in the monitoring of the perception module of ADS. In general, temporal logics are a family of formalism used to express temporal properties of systems, e.g., an event should always happen during a system execution; flagship temporal logics include linear temporal logic [105] and metric temporal logic [106]. Dokhanchi et al. [100] propose an adaptation of temporal logic to express desired properties of perception; the new formalism is called Timed Quality Temporal Logic (TQTL). Specifically, TQTL can be used to express temporal properties that should be held by the perception module during object detection, e.g., “whenever a lead car is detected at a frame, it should also be detected in the next frame.” Conceptually, the properties expressed by TQTL are similar to the ones in Reference [99]; however, by adopting such a formal specification to express these properties, one can synthesize a monitor that automatically checks the satisfiability of the system execution. TQTL is later extended to Spatio-Temporal Quality Logic (STQL) [101], which has enriched syntax to express more refined properties over the bounding boxes used in object detection. The authors also propose an online monitoring framework, named PerceMon, for monitoring the perception module at runtime of the ADS.

5.2.3 Test Adequacy.

Measuring the adequacy of the testing for DNN models in the perception module is challenging, due to the complexity of DNN models. Compared to program execution, DNN inference involves a completely different logical process, which is deemed to be non-interpretable. In this domain, various metrics, analogous to the test adequacy criteria for programs, have been proposed; some of the metrics are for general DNN testing, while some are dedicated to ADS testing. Below, we introduce two typical lines of such adequacy criteria.

Structural coverage. Neuron coverage is proposed in Reference [107], inspired by the structural coverage used in traditional software testing. Pei et al. [107] analogize DNN inference to program execution and consider the neuron activation as a symbol that indicates whether a neuron is “covered.” Based on this analogy, they define neuron coverage by what percentage of the neurons that are activated, as the counterpart of structural coverage in DNN. Inspired by Reference [107], a number of other neuron coverage criteria are proposed. For instance, k-multisection neuron coverage [108] is the refined version of neuron coverage that considers not only “activated” neurons but also “not activated” neurons; surprise adequacy [109] pursues the novelty of an individual test case based on whether it is from the distribution of the training data.

Combinatorial coverage. Combinatorial testing [110] utilizes combinatorial coverage for test case generation, which measures the coverage of the combinations of different system parameters. The t-way combination coverage is a typical criterion that is defined by the number of the t-wise combinations covered by the test suite of the total number of possible t-wise combinations. For instance, consider a system that has three binary parameters, a, b, and c. Given a test suite \(T=\lbrace \langle 0,0,1\rangle , \langle 0,1,0\rangle , \langle 1,0,0\rangle , \langle 1,1,0\rangle \rbrace\) that involves four test cases, the two-way combination of T is computed by \(\frac{1}{3}\) , which indicates that one combination \(ab\) is covered by T (since T involves all the possible cases 00,01,10,11 of \(ab\) ), over all the three possible combinations \(ab\) , \(ac,\) and \(bc\) .

Combinatorial coverage has been used to solve the adequacy problem in the testing of the perception module. Gladisch et al. [111] characterize the scenarios by using multiple parameters concerning different features, such as lane types and road types. They then apply combinatorial coverage as a guidance to generate test cases that can reveal system failures and achieve high coverage. Cheng et al. [112] propose k-projection coverage that aims to reduce the combinatorial explosion during test case generation by incorporating domain expertise. Xia et al. [113] utilize the analytic hierarchy process to identify the key factors and then generate test cases for a lane detection algorithm with combinatorial coverage guarantee.

5.2.4 Discussion.

Table 4 summarizes the collected papers for testing the perception module. This module includes a number of DNN models for understanding the environmental information. It is important, since many crashes are caused due to the vulnerabilities of this module [34]. It can be seen that a large number of papers perform adversarial attacks on this module. These methods include three categories, namely optimization-based attack, GAN-based attack, and Trojan attack. The first two types of methods could generate adversarial examples to fool the DNN models, while the third type of method targets the training process of the models. The DNN models take charge of various tasks of perception, including object classification [80, 83], semantic segmentation [65, 78], and camera-/LiDAR-based object detection [61, 62, 63, 64, 65, 66, 67, 71, 72, 73, 74, 75, 76, 77, 79]. Since the generated adversarial perturbations may not be effective in a noisy physical environment [133], a number of methods (e.g., Robust Physical Perturbations [80]) are proposed to overcome this challenge.

Table 4.

Methodology	Description	Literature	Test Objective	Environment
Optimization-based attack	Replacing true traffic signs with generated adversarial traffic signs	[61]	Object detector: Faster-RCNN [11]	Real world
	Generating transferable adversarial traffic signs and stickers	[62]	Object detectors: Faster-RCNN [11] and YOLOv3 [12]	Real world
	Generating camouflage on 3D objects	[63]	Object detectors: Mask R-CNN [114] and YOLOv3-SPP [12]	Simulation
	Focusing on the objectness loss	[64]	Object detector: YOLOv4 [69]	Digital dataset
	Adding perturbations to the unnoticed area	[65]	Segmentation models: ResNet-101 [115] and MobileNet [116]	Digital dataset
	Performing black-box attacks on traffic sign recognition models	[66, 67]	Models from the Kaggle Competition [117]	Digital dataset
	Adding spoofed points into the original 3D point clouds	[71]	Perception module of Apollo	Simulation
	Generating adversarial images against multi-sensor fusion based perception	[72]	Perception module of Apollo	Simulation
	Performing the black-box attack by the occlusion information	[73, 74, 75, 76]	Perception module of Apollo and LiDAR-based object detectors [118, 119, 120]	Digital dataset
	Generating perturbations for roadside objects	[77, 78]	LiDAR-based object detectors [118, 119, 121] and a segmentation model [122]	Real world
	Adding perturbations to the trajectories of vehicles	[79]	Object detectors: PointRCNN [118] and PointPillar++ [123]	Digital dataset
	Pasting generated adversarial stickers on traffic signs	[80, 81]	Classifiers: LISA-CNN [124], GTSRB-CNN [125] and Inception-v3 [126]	Digital dataset
GAN-based attack	Generating adversarial patches with high visual fidelity	[83]	Classifiers: VGG16 [127], ResNet [115] and VY [128]	Digital dataset
	Proposing a multi-source attack method	[84]	Semantic segmentation model: VAE-GAN [129]	Digital dataset
	Synthesizing corner cases by utilizing CycleGANs	[85]	Object detector: PatchGAN [130]	Digital dataset
Trojan attack	Utilizing particle swarm optimization	[88]	Classifier: LeNet-5 [131]	Digital dataset
Trojan attack	Performing the Trojan attack for models used for raindrops removal	[89]	Claffifiers: DeRaindrop Net [91] and RCAN [132]	Digital dataset

Table 4. Summary of the Papers for the Perception Module Testing: Part I

Table 4.

Oracle	Description	Literature
Ground-truth labeling	Generating road label automatically by LiDAR	[92]
Ground-truth labeling	Generating dimension and classification references for object detection	[93]
Metamorphic testing	The detected object in the original images should also be detected in the synthetic images	[94]
	The recognition results of target models should change when traffic lights change	[95]
	The detection of objects should be same with the affect of noise	[96]
	Check the model inconsistencies between the original images and the mutated images	[97, 98]
	The object detected in a previous frame should also be detected in a later frame	[99]
Formal specifications	Adapting TQTL to express desired properties of perception	[99, 100]
Formal specifications	Adapting STQL to express more refined properties	[101]
Adequacy	Description	Literature
Structural coverage	Neuron coverage: the percentage of the neurons that are activated	[107]
	K-multisection neuron coverage: considering activated neurons and not activated neurons	[108]
	Surprise adequacy: the novelty of a test case based on its range in the training data distribution	[109]
Combinatorial coverage	Characterizing the scenarios by multiple parameters	[111]
	Incorporating domain expertise to reduce the combinatorial explosion	[112]
	Utilizing the analytic hierarchy process to identify the key factors	[113]

Table 4. Summary of the Papers for the Perception Module Testing: Part II

As mentioned in Section 5.2.2, it is also a great challenge to judge the correctness of the output of the perception module. We collect three types of approaches, including ground-truth labeling [92, 93], metamorphic testing [94, 95, 96, 97, 98, 99], and formal specifications [100, 101] for tackling this problem. In summary, the first approach focuses on automatically generating ground-truth labels for single images, while the other two methods tend to express the properties between continuous frames and are thus suitable for evaluating the perception module at runtime.

Traditional coverage metrics, e.g., code coverage, are typically not suitable for estimating the test adequacy of DNN-based models. Structural coverage metrics like neuron coverage [107] have become a mainstream substitute. Recently, there is a different voice [134, 135, 136] saying that neuron coverage and its extensions may lack effectiveness in guiding ML testing. In addition to neuron coverage, combinatorial testing [111, 112, 113] is another approach for tackling the test adequacy problem of the perception module.

Summary: A large number of papers perform adversarial attacks on the perception module, covering various perception tasks, e.g., camera-/LiDAR-based object detection and semantic segmentation. Testing oracle problem of this module has been studied through different approaches, e.g., metamorphic testing. Structural coverage metrics (e.g., neuron coverage) and combinatorial testing techniques are widely adopted for the guarantee of testing adequacy.

5.3 Planning Module

The planning module takes the information from the perception module as input and produces a suitable driving trajectory as a reference for the control module to make decisions. In the planning module, we introduce the studies on test methodology (shown in Section 5.3.1), test oracle (shown in Section 5.3.2), and test adequacy (shown in Section 5.3.3).

5.3.1 Test Methodology.

Testing of the planning module consists in providing traffic scenarios for an ADS and checking whether the planner module generates trajectories that satisfy properties such as safety, comfort, and low cost. Note that the path planning module is usually integrated into the whole ADS and highly coherent with other modules: The input of the module comes from the perception module, and the output trajectory is a reference for the control module rather than the actual one observable from the system. Therefore, testing the planning module independently is a challenging task.

Due to the above reasons, there exist no large numbers of studies on the testing dedicated to the planning module. The studies we collected are based either on dedicated path planning systems [137, 138, 139], or on the assumption of the perfection of the perception and control modules [140]. In summary, search-based testing is the major technique for testing of the planning module, and it is adopted in most of the works [137, 138, 139, 141, 142, 143, 144, 145] for this module.

Search-based testing. According to Reference [146], scenarios are defined on three abstraction levels, namely functional scenarios, logical scenarios, and concrete scenarios. A functional scenario has the highest abstraction level and defines only the basic conditions and participants of a scenario; on top of a functional scenario, a logical scenario is defined by a set of parameters and their ranges; with the parameter values fixed in a logical scenario, a concrete scenario is generated. In the context of scenario generation, search-based testing usually consists in searching in the parameter space of a logical scenario for a concrete scenario, with specific objectives. Below are some examples of applying search-based testing to generate concrete scenarios for the testing of the planning module.

References [137, 138, 139, 141] use a dedicated path planning system from their industry collaborator that computes the trajectories of the ADS based on several constraints, e.g., safety and traffic regulations. The aggressiveness of the path planning strategy is decided by a system parameter, named weight. Laurent et al. [137] define a coverage criterion named weight coverage, which is used to characterize the testing adequacy of the weight parameter. In Reference [138], they propose two search-based techniques, named single-weight approach and multi-weight approach, that automatically generate testing scenarios guided by weight coverage. Specifically, the single-weight approach searches for the scenarios that cover one specific weight of the path planner, while the multi-weight approach generates scenarios that cover different weights simultaneously using the multi-objective search. Arcaini et al. [139] consider searching for the driving patterns that are identified by the features appearing in the planned trajectory, such as longitudinal/lateral acceleration and curvature. The driving patterns that take place in a trajectory for a considerable duration are relevant to the characteristics of the path planner and thus facilitate engineers in system assessment. Since the testing scenarios in their previous works contain numerous irrelevant elements and are thus hard to debug, in their latest work [141], they target the simplification of testing scenarios—they remove all the irrelevant traffic participants, but the failures can still be triggered.

Althoff et al. [142] propose the notion of drivable area for motion planning algorithms, which represents a safe solution space in which the ADS can avoid collision. Then they adopt a search method to generate scenarios that are highly critical in the sense that the drivable area is limited. In their follow-up work [143], the authors consider the interference of other traffic participants in the drivable area to increase the complexity of the scenarios. In their experiment, the evolutionary algorithms [147] are demonstrated to be advantageous in finding a local optimum over these complex and diverse scenarios. Bak et al. [144] apply random exploring trees (RRT) to search for the adversarial agent perturbations, which indicate that the behaviors of other vehicles are only modified slightly. Kahn et al. [145] generate occlusion scenarios for testing the behavior planning module of an ADS. To be specific, they apply an occlusion-guided search method to inject vehicles into the scenarios extracted from naturalistic data. Experimental results show that the number of occlusion-caused collisions generated by their approach is 40 times higher than that from the naturalistic data.

5.3.2 Test Oracle.

Assessing the correctness of the output of the planning module, i.e., a planned trajectory, is a challenging problem, due to the lack of an oracle that represents the “correct” trajectory. In this section, we introduce the studies [148, 149] that define different metrics as the oracles to evaluate the correct functionality of the planning module.

Calò et al. [148, 149] define the notion of avoidable collisions to distinguish them from unavoidable collisions, in a dedicated path planner. By their definition, a collision is avoidable if it can be avoided from happening in the same scenario by using a different system configuration of the ADS. Compared to the unavoidable ones, the avoidable collisions are considered critical, since these collisions require system reengineering to rectify the unsafe behavior.

5.3.3 Test Adequacy.

As mentioned in Section 5.3.1, the inputs to the planning module involve both external parameters that identify a scenario and internal parameters of the ADS. Because there are infinitely many possible combinations of these parameters, generating test cases that are sufficiently diverse remains a great challenge. In this section, we collect the studies [137, 140] that propose coverage measurements on the space of the parameters. Namely, the weight coverage criterion [137] refers to the coverage of the possible configurations of the path planner under test; the route coverage criterion [140] is proposed to measure whether different features of a map have been explored by the test suite.

Laurent et al. [137] propose a coverage criterion, named weight coverage, to test a dedicated path planning system. In their path planner, there is a weight function that consists of six weight parameters, which affect the path planning decisions from different aspects, such as safety and comfort. To cover diverse planning decisions made by the system, the authors use the weight coverage to guide the exploration of the weight parameter space. Thereby, they manage to generate scenarios that cover more diverse combinations of the weight parameters.

Tang et al. [140] propose another coverage criterion called route coverage for testing the route planning functionality of Apollo. Based on a Petri net abstracted from the map, they quantify the route diversity based on the junction topology feature and route feature. The junction topology feature describes the relative position and connection relationship of the roads at a junction, while the route feature describes the action of Apollo to track a selected road. By mutating the test cases, they achieve a high route coverage ratio and thus obtain a diverse test suite that covers various features of the map.

5.3.4 Discussion.

The summary of the collected papers for testing the planning module is shown in Table 5. We find that search-based testing is a dominant technique that has been demonstrated to be effective in revealing faults in the planning module [137, 138, 139, 141, 142, 143, 144, 145]. In addition, several metrics are proposed for facilitating the testing on the planning module, e.g., avoidable collision [148] for tackling the oracle problem and weight coverage [137] and route coverage [140] for evaluating the sufficiency of the test suite. However, most of these metrics are dedicated to specific path planning systems, and it needs to be further explored whether they could be generalized to the planning modules of other systems.

Table 5.

Summary: Search-based testing is an effective technique for revealing faults in the planning module. Several metrics have been proposed for testing particular path planning systems, and it needs further exploration on how to generalize these metrics to other systems.

5.4 Control Module

Based on the trajectories produced by the planning module, the control module takes charge of the lateral and longitudinal control of the ADS. By using various control algorithms, such as MPC and PID control, it generates control signals, e.g., acceleration, deceleration, and steering angle, to the CAN bus for the control of the whole system. In this module, we introduce the works on the testing of the control module, from the perspectives of test methodology (shown in Section 5.4.1) and test oracle (shown in Section 5.4.2).

5.4.1 Test Methodology.

The control module takes charge of multiple functionalities, such as the longitudinal control and the lateral control of the ADS. Hence, the testing of this module focuses on detecting vulnerabilities in the control mechanisms.

Fault injection. Fault injection is a method that deliberately introduces faults into a system to assess the fault tolerance of the system. Uriagereka et al. [151] adopt this technique for testing the fault tolerance ability of the control module of an ADS. Specifically, they inject faulty GPS signals into the lateral control function of the ADS, which makes it produce wrong steering commands. By calculating the fault tolerant time interval, which denotes the duration from the activation of the fault to the occurrence of unsafe behavior, they find the lateral control system can tolerate this type of fault for as long as 177 ms. Zhou et al. [152] inject faulty control signals through CAN bus to cause collisions without being detected by ADS safety mechanisms, e.g., forward collision warning. They evaluate the method with OpenPilot and find that the lateral control of the system is the typically vulnerable part with a high attack success rate.

Sampling. We refer to sampling as the statistical method that samples values from a probability distribution. Wang et al. [153] sample the relevant parameters, e.g., speeds of the non-player characters (NPCs) and the ego vehicle, for generating scenarios of different challenge levels. The method is evaluated on the unprotected left-turn scenario and experimental results demonstrate the robustness of the MPC controller. In their later work [154], game theory is applied to characterize the interactive behaviors of NPCs in highway merging scenarios.

Falsification. Temporal logic-based falsification [155, 156, 157, 158, 159, 160, 161] is applied to ADS testing in References [162, 163]. Originally, falsification refers to a technique for testing of the general cyber-physical systems, guided by the quantitative semantics of temporal logic specifications, which indicates how far is the system from being unsafe. Tuncali et al. [162] propose a falsification-based automatic test generation framework for testing collision avoidance controllers. They utilize a cost function, i.e., the quantitative semantics of the temporal logic specification, as a guidance in searching for the critical scenarios in which the relative speed of the two vehicles in the collision is minimal. The obtained scenarios can be taken as the behavioral boundary that divides the safe and unsafe behaviors. As a follow-up work, Tuncali et al. [163] utilize the RRT algorithm for ADS falsification. They incorporate a new cost function that applies time-to-collision to measure the seriousness of the collision. As a result, the new method achieves better effectiveness in searching for safety-critical scenarios, thanks to the exploration brought by the RRT algorithm.

5.4.2 Test Oracle.

Like the planning module, the control module also faces the oracle problem in its testing—indeed, it is usually not straightforward to determine whether a control decision is “correct.” Djoudi et al. [164] propose a framework to determine whether the control module makes “the correct decision.” They design a model to generate an oracle area in the given scenario, which is the closest safe position ahead of the vehicle. A control decision is then considered as “the correct decision” if it could drive the vehicle close to the oracle area.

5.4.3 Discussion.

Table 6 summarizes the collected papers for the testing of the control module, where the number of studies is not too large. Note that currently most of the control modules of ADS adopt mature control techniques directly, such as PID [13] and MPC [14], which partially explains why this module is not extensively studied. The collected studies adopt three major techniques for testing the control module, including fault injection [151, 152], sampling [153, 154], and falsification [162, 163]. To tackle the oracle problem in control module testing, the framework proposed in Reference [164] could generate an oracle area for judging whether the control decision is “correct.” However, we do not find much work that handles the test adequacy problem for this module. In general, since the control module deals with continuous dynamics, it is challenging and thus requires further exploration to define adequacy criteria for test cases in the future.

Table 6.

Summary: Since most of the control modules of the ADS adopt those mature control techniques, e.g., MPC and PID, there are not many works studying the testing of the control module. Existing techniques mainly include fault injection, sampling, and falsification. There are some works that study the oracle problem of control modules. There is few work that studies the adequacy criteria for testing control modules.

5.5 End-to-End Module

The end-to-end module is a special design adopted by many modern ADS, which integrates the functionalities of perception, planning, and control in a single DNN-based model. The DNN model is often developed by supervised learning, which is trained by using a training dataset consisting of realistic driving data. Each element of the dataset is a pair \(\langle I, c\rangle\) , which maps the information I at the end of the sensor to a label c that indicates the desired control decision at the end of the controller. After training, a model can infer control decisions based on the driving environment at runtime to drive the ADS properly. For instance, in some modern ADS that perform steering angle control, the end-to-end DNN model takes as input the sensing information including road conditions and the status of other cars and outputs a series of predicted steering angles for controlling the ADS. In this section, we introduce the collected studies on the testing of the end-to-end module from the perspectives of test methodology (shown in Section 5.5.1), test oracle (shown in Section 5.5.2), and test adequacy (shown in Section 5.5.3).

5.5.1 Test Methodology.

As mentioned before, an end-to-end DNN model integrates three functionalities, namely perception, planning, and control, in a single module. Among these three functionalities, perception is the most vital part as it provides input information to other modules; meanwhile, it is also the most vulnerable to external environments, as it essentially involves image recognition tasks that rely on deep learning. Compared to a DNN just for perception, although an end-to-end DNN does not directly output the perception information, the control decisions it makes still depend on the perception information. Therefore, like the case in the perception module, generating adversarial images or scenarios that fool the end-to-end DNN is still the major testing methodology for testing the end-to-end modules.

We introduce three approaches, namely search-based testing, optimization-based adversarial attack, and GAN-based attack. The first approach has been introduced in Section 5.3.1; the last two approaches have been introduced in Section 5.2.1.

Search-based testing. Search-based testing searches for a target test case in the input space, guided by certain objectives. One commonly used objective is the coverage of the test suite—maximizing the cumulative coverage of a test suite can expose more diverse behavior of the system, and thus allow a better chance of detecting the target test case. In the context of DNN testing, neuron coverage is proposed by Pei et al. [107] to analogize the structural coverage in traditional programs. In their follow-up work, Tian et al. [165] propose a coverage-guided testing framework called DeepTest for DNN testing. They propose various image operations, e.g., scaling, shearing, and rotating, as the test input (image) mutation methods; then they generate test cases by applying these operations to seed images and keep only those mutants that enlarge the cumulative neuron coverage of a test suite. Experiments are conducted on three end-to-end models, and the results show the effectiveness of their method in test case generation.

In addition to coverage, the seriousness of the unsafe behavior is another factor that can be used as the search objective, and this has been considered by Li et al. [166]. In their work, the seriousness of the unsafe behavior of the end-to-end module is formulated as the deviation of the actual steering angle made in the test scenario from the expected steering angle. The authors design an objective function that takes into account both the coverage and the seriousness, such that they can detect not only diverse but also serious unsafe test cases.

Optimization-based attack. The optimization-based adversarial attacking framework has been introduced in Section 5.2.1. Zhou et al. [167] introduce a framework called DeepBillboard that can generate adversarial perturbations that are added to billboard. The perturbations they generate can mislead the steering angles in a series of frames captured by camera sensors during the driving process, in spite of the physical conditions, such as different distances and different angles to the billboard. Later Pavlitskaya et al. [168] extend DeepBillboard with the projected gradient sign method [169], and experimental results show that the curved and rainy scenes are more vulnerable to these adversarial attacks. In another line of work, adversarial black lines are utilized to attack the end-to-end driving models [170, 171]. These black lines are easy to paint on the public road and can lead to a deviation of an ADS from the original path.

GAN-based attack. GAN has been introduced in Section 5.2.1, and it has been considered as a major approach for adversarial attacking. Kong et al. [172] propose a GAN-based approach called PhysGAN, which utilizes 3D tensors, i.e., a slice of video containing hundreds of frames, to generate adversarial roadside signs that can continuously mislead the end-to-end driving models with high efficacy and robustness. In another work, to generate realistic adversarial images, Zhang et al. [173] propose a GAN-based approach called DeepRoad. They demonstrate that their generated adversarial images are realistic under various weather conditions and effective in detecting unsafe system behaviors.

5.5.2 Test Oracle.

An oracle of the end-to-end module indicates which is the correct control decision at each moment of a scenario. Although this can be done with the help of human drivers, it is too expensive and prone to errors. Existing works propose various automatic methods to solve the oracle problem of the end-to-end module, including metamorphic testing [165, 173, 174], differential testing [107], and model-based oracle [175, 176].

Metamorphic testing. As introduced in Section 5.2.2, metamorphic testing is a viable way to solve the oracle problem. In the testing of end-to-end models, there are a few works that leverage metamorphic relations to define the test oracles, e.g., DeepTest [165] and DeepRoad [173]. The metamorphic relation introduced by DeepTest [165] is that, the steering angle should not change significantly for the same scenes under different weather and lighting conditions. Similarly, DeepRoad [173] aims to detect model consistency, which means, for a synthetic image and the original image, the difference between two predicted steering angles is smaller than a threshold. Pan et al. [174] introduce a metamorphic relation for testing end-to-end models in a foggy environment; the relation requires that the density and direction of fog not affect the output steering angle of the target models.

Differential testing. Pei et al. [107] apply differential testing to generate scenarios that reveal the inconsistencies between different DNN models. For the same scenario, they expect that the DNNs under test should give the same inference result. The violation of this property is considered as an unexpected behavior.

Model-based oracle. Stocco et al. [175] propose a so-called self-assessment oracle for the potential risk prediction of ADS. The self-assessment oracle involves training a probabilistic model that characterizes the distribution of the potential risks under various real scenarios. This model can be used to monitor the real environment during the execution of the ADS and predict situations that are probably not handled by the ADS. This novel idea is also studied by Hussain et al. [176].

5.5.3 Test Adequacy.

Combinatorial coverage is also adopted in end-to-end module testing, e.g., the two-way combinatorial testing based on image transformations [177]. In Section 5.2.3, we introduce the structural coverage for DNN testing, which analogizes the structural coverage in traditional program testing. Since the end-to-end module also relies on DNN models, these structural coverage criteria are also used in the testing of the end-to-end module. Neuron coverage, which has been introduced in Section 5.2.3, is used by its authors for a coverage-guided testing [165], as mentioned in Section 5.5.1. The refined structural coverage criteria for DNNs, such as k-multisection neuron coverage and neuron boundary coverage [108, 166], which is also elaborated on in Section 5.5.1.

5.5.4 Discussion.

Similar to the perception module, the end-to-end module also contains many DNN-based models; however, these models are used not only for perception but also for the control of the vehicles. As shown in Table 7, adversarial attack methods used in the perception module testing, including optimization-based method [167, 168, 170, 171] and GAN-based method [172, 173], are also adopted as the testing methodologies for this module. One observation is that, compared to perception module testing that tests DNN models using single images, the work [167, 172] for end-to-end module testing often use a series of images, i.e., the frames captured by cameras in a system execution. Another major testing approach is the coverage-based testing [107, 165, 166, 173], in which the testing is guided by coverage criteria proposed for measuring whether the system behavior has been sufficiently explored.

Table 7.

Table 8.

Category	Name	Description
Temporal metrics	TTC [163]	The time until two objects collide with the current speed and path
	WTTC [241]	The time of the collision in the most likely accident scenario
	MprISM [242]	Estimating the TTC with the consideration of game interaction between vehicles
	THW [243]	The time between two objects reaching the same location
	MprISM [242]	The remaining time until the start of the last driving maneuver that can avoid collisions with all objects in the scenario
Non-Temporal metrics	SD [196]	The stopping distance of the vehicle at the maximum comfortable deceleration
	LP [191]	The distance between vehicle center and lane center
	DRAC [244]	The minimum deceleration rate required by a vehicle to avoid a crash
	SARR [245]	The number of steering angle reversals larger than a certain value

Table 8. Commonly Used Safety Metrics

Since it is hard to evaluate the correctness of the output steering angle for an input image, metamorphic testing [165, 173, 174] and differential testing [107] are adopted for tackling this problem. In addition, we find that other oracle techniques, e.g., model-based oracles [175, 176], can be used to solve the oracle problem for this module.

Summary: Many of the testing techniques used for testing the perception module can also be used for testing the end-to-end module, such as adversarial attack. One notable difference from the perception module is that, in the end-to-end module, these techniques are applied in a driving context involving a series of continuously changing images, rather than a single image. Besides those metrics that have been used in the perception module, e.g., metamorphic testing, new techniques, e.g., differential testing, are employed to solve the oracle problem. In terms of test adequacy metrics, this module is very similar to the perception module.

5.6 Answer to RQ1

In total, we survey over 80 papers that study the testing of different modules of ADS. Various testing techniques have been proposed for testing different modules. Based on our survey, we can draw the following conclusions: (1) for the sensing module, physical testing and deliberate attack on the sensors could effectively find their abnormal behaviors; (2) for the perception module and the end-to-end module, adversarial attack is the most widely used approach, since the two modules mainly rely on the use of DNN-based models; (3) for the planning module, though the relevant studies are not so many, search-based testing has been extensively adopted; and (4) for the control module, main testing techniques include fault injection, sampling, and falsification.

Despite the numerous techniques dedicated to different modules, we also find some open challenges for the testing of these modules. For example, the neuron coverage in Reference [107] may not be effective for testing the perception module and the end-to-end module. More details about these open challenges are also discussed in Section 8.

6 Literature of Techniques On System-level ADS Testing

In this section, we introduce the research works on system-level ADS testing with the goal of answering RQ2 in Section 1. Different from module-level testing, system-level testing focuses on the failures that threaten the safety of the whole vehicle due to the collaborations between modules. In Section 5, most of the testing works are done in simulated environments, implemented by various software simulators. In this section, we introduce not only simulation-based testing in Section 6.1 but also introduce the mixed-reality testing in Section 6.2 that introduces real hardware in the testing loop.

6.1 System-level Testing with Simulators

We first introduce system-level testing conducted with the help of software simulators. Similarly to the module-level works, we also present these studies from three perspectives, namely test methodology (shown in Section 6.1.1), test oracle (shown in Section 6.1.2), and test adequacy (shown in Section 6.1.3).

6.1.1 Test Methodology.

In the literature, we find various testing methods for the system-level testing of ADS, including search-based testing, adaptive stress testing, sampling-based methods, and adversarial attack. In this section, we introduce these testing methods.

Search-based testing. Search-based testing (or a similar concept named fuzzing⁶) is one of the most widely adopted methodologies in ADS testing. As introduced in Section 5.3.1, it consists in searching in the parameter space for specific parameter values that achieve a testing objective. In this section, we introduce the works [187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205] to illustrate the ideas.

Dreossi et al. [187] propose a compositional search-based testing framework and apply it for the testing of ADS with machine learning components (i.e., mostly perception). The basic idea in their work is the cooperative use of the perception input space and the whole system input space: The constraints on one space can reduce the search efforts in the other space. In this way, they improve the efficiency of searching for counterexamples. Abdessalem et al. [188] propose a multi-objective search algorithm for detecting errors caused by feature interaction. A feature interaction describes the interaction between different ADS functionalities, e.g., an AEB command could be overridden by an ACC command, since the two functionalities both control the braking actuator. In practice, search-based testing has also proved to be effective for industry-level ADS. Li et al. [196] propose AV-Fuzzer used for testing of Apollo, and they show the effectiveness of this framework in finding dangerous scenarios.

There is a line of work [190, 191, 192] that studies the relationship between test input and system behavior. Riccio et al. [190] propose the notion of frontier of behaviors, which represents the boundary of inputs where the system starts to behave abnormally. In their later work [191], they first provide an interpretable feature map that explains the correlations between test inputs and system behaviors and leverage Illumination Search [206] to explore the feature space. This approach is enhanced in their follow-up work [192] for finding those test inputs that contribute to the exploration of the feature map.

Lane keeping system is an important target in ADS testing. When search-based testing is applied, different road representations can affect the effectiveness of the approach. Castellano et al. [193] compare six road representations for testing lane keeping systems and found that curvatures and orientation are essential factors that affect the behaviors of these systems. Gambi et al. [189] propose a novel approach called ASFAULT to generate virtual roads for testing lane keeping systems. Experiments on BeamNG.AI [27] and DeepDriving [207] demonstrate that the proposed approach could generate effective testing roads that cause vehicles to deviate from the correct lane. Open source search-based tools, e.g., Frenetic [208], are also developed and the comparison of these tools are reported in References [26, 209].

There are other works that aim to improve the search efficiency by designing better search algorithm. Abdessalem et al. [188, 210] combine multi-objective search with decision tree classification for test generation of ADS. In their framework, the classification checks whether the scenario is a critical one, and accelerates the search process. Goss et al. [194] apply RRT based on an Eagle Strategy to estimate the critical scenario boundaries. Zheng et al. [195] propose a quantum genetic algorithm that allows lower population size. Luo et al. [199] study the test case prioritization techniques and employ multi-objective search algorithms to find violations with a higher probability of occurrence. Test case prioritization techniques are also utilized to accelerate the regression testing of ADS [200, 201] and achieve remarkable results.

Search-based testing is usually based on system simulations; however, even with software simulators, the simulations of ADS can still be expensive and slow. There is another line of work [202, 203, 204, 205] that trains surrogate models as the substitute for testing acceleration. Abdessalem et al. [202] train a surrogate model that maps the scenario parameters to fitness functions and use the surrogate to detect the non-critical parameters for search space reduction. Gaussian Process is also leveraged for training a surrogate model in Reference [203]. To search for more collision scenarios, Beglerovic et al. [204] train a surrogate model by utilizing the critical scenarios that already exist. For finding an optimal surrogate model for ADS testing, Sun et al. [205] compare six types of surrogate models, e.g., Extreme Gradient Boosting and Kriging surrogates, in two logical scenarios.

Adaptive stress testing. Stress testing has been widely adopted in various domains of the industry, which performs testing by providing test cases beyond the capability of the system under test. Adaptive stress testing, literally, performs stress testing in an adaptive manner; namely it prioritizes the test cases and allocates different testing resources to them accordingly. Therefore, specifying the policy of priority assignment is the key to adaptive stress testing. Koren et al. [211] apply adaptive stress testing for ADS and design a priority assignment policy based on the difference between the expected behavior and the actual behavior. In a later work [212], they propose a new priority assignment policy based on Responsibility-Sensitive Safety (RSS) [213], which defines the utopian behavior of the cars by which no collision will happen in a scenario. The new policy is thus defined according to the distance of the ADS behavior compared to the utopian cases in the RSS rules [212]. Baumann et al. [214] adopt reinforcement learning, namely Q-learning [215], for exposing more critical scenarios in the overtaking scenario. Reinforcement learning is also combined with RSS rules for generating edge cases in Reference [216].

Sampling. One use case in ADS testing is to generate scenarios by sampling from a natural scenario distribution to make the generated scenario realistic. This has been studied in Reference [217]. Nitsche et al. [218] propose a sampling-based framework for validating ADS at road junctions. Specifically, they first cluster the junction scenarios along with the representative variations from the real-world accident data, and then these relevant parameters are sampled by the Latin Hybercube Sampling method and used to compose concrete scenarios for simulation testing.

Sampling is also used to help the identification of the failure features. Corso et al. [219] combine signal temporal logic (STL) with sampling method to generate disturbance trajectories for testing. Those trajectories are interpretable and easier for debugging due to the features of STL, i.e., the description of logical relationships over time. In another work [220] of them, dynamic programming is applied during the sampling process to discover more failure scenarios.

Batsch et al. [221] sample the simulation data in a traffic jam scenario with the CarMaker [222] simulation platform. The obtained datasets are then used to train a Gaussian Process Classification model, which could probabilistically estimate the boundary between safe scenarios and unsafe scenarios. Schütt et al. [223] utilize Bayesian optimization and Gaussian process to identify the relevant parameters of a logical scenario, i.e., they find the vehicle speed has no influence in one vulnerable road user testing scenario. Birkemeyer et al. [224] leverage a Feature Model, i.e., features are organized as nodes in a tree structure, to represent a scenario space for sampling. Experimental results show that the FM-based sampling method is suitable for scenario selection for ADS testing.

Moreover, advanced sampling techniques can be applied to achieve specific goals; for example, importance sampling [225] is a technique used to sample rare events. In normal occasions, unsafe scenarios are indeed rare to happen, so detecting those scenarios is hard and costly. In that case, importance sampling can be applied to accelerate the testing [226, 227]. Zhao et al. [228, 229, 230, 231, 232, 233, 234] work extensively in this direction. The main aim of their work is to spend less simulations to detect more system failures, under various scenarios. Specifically, in References [228, 229, 230, 232, 233], they investigate the cut-in/lane change scenarios; in Reference [231] and Reference [234], they focus on the car-following scenario and the unprotected pedestrian crossing scenario, respectively.

Adversarial attack. Adversarial attack has been introduced in Section 5.2.1, in which it is used for testing the perception module. Here we introduce several works [235, 236, 237, 238] that also attack the perception module, but they assess the influence of the attack on the whole system. Sato et al. [235] generate attack patches, as a camouflage for dirty roads, that mislead the lateral control functionality of the victim ADS to deviate from the lane. Rubaiyat et al. [236] generate perturbations to camera-captured images, based on a system-level safety risk analysis, to assess the reliability of OpenPilot under real-world environmental conditions. Nassi et al. [237] leverage the print advertisement to perform the attack, e.g., they embed an adversarial traffic sign on the back of other vehicles and mislead the system to wrong behaviors. Wang et al. [238] perform an attack that adds perturbations to the trajectories of NPCs and modifies the corresponding LiDAR sensor data.

6.1.2 Test Oracle.

The oracles of the system-level testing of ADS are usually defined by safety metrics, such as time-to-collision, which measures how far the ADS under test is from dangerous situations. These metrics can be directly computed by monitoring the system behavior in the simulator or expressed as formal specifications, such as STL, which can automatically monitor the system behavior and compute the metric values. Besides, metamorphic relations are also used in some works for defining the oracle of ADS.

Safety metrics. In system-level testing, a suitable safety metric, or called criticality metric, can be leveraged to find more system violations. There have been studies [9, 239, 240] that comprehensively investigate these safety metrics and here part of commonly used metrics are listed in Table 8. These safety metrics can be categorized into temporal metrics and non-temporal metrics. Temporal metrics describe the temporal requirements to moving objects, and the most popular ones are Time-to-Collision (TTC) [163] and its extensions, e.g., Worst-Time-to-Collision (WTTC) [241], which measure the closeness of the ego car to collision in the scenario. Weng et al. [242] propose the Model Predictive Instantaneous Safety Metric (MPrISM), which considers the interaction between moving vehicles. Another metrics include Time Headway (THW) and Time-to-React (TTR) [243]. The former calculates the time of the ego vehicle to reach the position of the lead vehicle, and the latter estimates the remaining time for a required reaction, e.g., a braking action.

Non-temporal metrics concern different aspects, such as distance, deceleration, and steering. One distance metric called Stop Distance (SD) [196] calculates the distance for a vehicle to stop with a maximum comfortable deceleration. Another distance metric is called Lateral Position (LP) [191], which defines the distance between the center of the vehicle and the center of the driving lane. Deceleration metrics, such as Deceleration Rate to Avoid a Crash (DRAC) [244], consider the deceleration rate during emergency. Steering metrics, such as Steering Angle Reversal Rate (SARR) [245], focus on the steering angle of a vehicle during the driving process.

There are works [197, 198] that propose to organize and utilize these safety metrics in an elegant manner. Also, Li et al. [246] propose to design metrics that involve more factors such as the relationship between scenarios, tasks, and functionalities of an ADS.

Formal specifications. As introduced in Section 5.2.2, formal specification uses temporal logic languages to express the properties that the system should hold during the running; then by specification-based monitoring, the satisfaction of the system behavior can be automatically decided. On the system-level testing of ADS, STL, which can express the properties over real-time continuous variables, is the proper selection of specification language. There are a few works that adopt STL as the specification language [187, 247, 248], in which STL monitors are synthesized to decide whether the behavior of the ADS satisfy the desired safety properties. Zhang et al. [249] utilize formal specifications to represent driving rules and ADS behaviors to check the consistency between them.

Metamorphic testing. Metamorphic testing has been discussed for the module-level testing in Sections 5.2.2 and 5.5.2. On the system-level testing, Han et al. [250] utilize metamorphic relations to distinguish between real failures and false alarms. The metamorphic relation regulates that the behavior of the ADS should be similar in slightly different scenarios; otherwise, the collision in one of such scenarios is considered avoidable and thus a real failure.

6.1.3 Test Adequacy.

In system-level testing, the adequacy of testing is embodied by the diversity of the testing scenarios for the ADS. In this section, we introduce two lines of work that define various metrics to characterize the diversity of scenarios.

Scenario coverage. There is a line of work that defines coverage for scenarios. The intuition is that the testing is sufficient if all different types of scenarios are covered [251]. Tang et al. [252] classify the scenarios based on the topological structure of the map. Kerber et al. [253] define a distance measure over scenarios based on their spatiotemporal features, which enable scenario clustering. Besides, the temporal, spatial, and causal information of the simulation data can be further abstracted into situations for covering more test scenarios [254, 255].

Combinatorial coverage. Combinatorial coverage has been introduced in Section 5.2.3. Unlike the above coverage criteria defined directly on the features of the scenarios, combinatorial coverage considers the coverage of the combinations of different parameters that identify different scenarios. Tuncali et al. [247, 248] propose to use covering array for scenario generation in ADS testing. Covering array is a specific mechanism in software testing that guarantees the satisfaction of the t-way combination coverage of the parameters. See Section 5.2.3 for more details about t-way combination coverage. Guo et al. [256] propose the definition of scenario complexity and apply combinatorial testing techniques to generate more complex testing scenarios. Shu et al. [257] adopt the three-way combinatorial testing method on lane-changing scenarios, which ensures a high coverage of the generated critical scenarios. Li et al. [258] utilize the ontology concept, i.e., formulations of entities and their relationships, to describe the driving environment of an ADS. Then the constructed ontologies are combined with combinatorial testing techniques for generating concrete scenarios with coverage guarantee. Another work [259] proposes a scenario generation framework called ComOpT, based on t-way combinatorial testing, and finds numerous system failures of Apollo. Moreover, combinatorial testing is also used in References [260] to tackle the regression testing problem of ADS.

6.1.4 Discussion.

As shown by Table 9, search-based testing [187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205] is the most widely used technique for testing the whole ADS, with different focuses, e.g., studying the relations between test input and system behavior [190, 191, 192], testing lane keeping systems [26, 189, 193, 209] and test case prioritization [199, 200, 201]. Although simulation-based testing aims to solve the high-cost problem of real-world testing, it may repeatedly simulate the same type of scenarios, which is also a time-consuming process. Consequently, adaptive stress testing [211, 212, 214, 216] and sampling-based techniques [217, 218, 219, 221, 223, 224, 228, 229, 230] are applied for accelerating the testing process. As in the cases of the perception and end-to-end modules, adversarial attack [235, 236, 237, 238] has also been adopted for system-level testing, which aims to detect the vulnerabilities of the perception that affect the safety of the whole system. Note that among these testing techniques, adaptive stress testing has not been studied extensively, but it has a high potential for future ADS testing, since it is effective in various domains of the industry [269].

Table 9.

System-level testing usually relies on safety metrics, e.g., temporal and non-temporal metrics (as shown in Table 8) and metamorphic relations [250], as the oracles that measure the occurrences of safety violations during the testing process. For ensuring the adequacy of system-level testing, there are two lines of work, namely scenario coverage [252, 253, 254] and combinatorial testing [247, 248, 256, 257, 258, 259, 260], which propose metrics to characterize the diversity of testing scenarios.

Overall, there exist more works in system-level testing than that in module-level testing. Moreover, there are many other works that study the differences between simulation-based testing and real-world testing. More details can be found in Section 6.3.

Summary: There are a large number of studies that leverage different techniques, e.g., search-based testing, adaptive stress testing, and sampling-based techniques, for testing the ADS at the system level. Besides, numerous metrics have been proposed for different usages in the testing process, e.g., they are used to measure the occurrences of safety violations, and they are used to characterize the diversity of testing scenarios.

6.2 Mixed-Reality Testing

Due to the expensiveness of the real-world testing of ADS, most approaches in Section 5 and in Section 6.1 test ADS in software simulators. Although modern simulators can be powerful and high-fidelity, simulation-based testing is not sufficient to reveal all the problems of ADS, due to the gap between simulators and the real world. As a tradeoff, mixed-reality testing combines simulation-based testing with real-world testing. In this section, we introduce several special testing schemes, which replace certain parts of the components in the testing loop with physical components. Specifically, these schemes include hardware-in-the-loop (HiL), vehicle-in-the-loop (ViL), and Scenario-in-the-Loop (SciL); their mechanisms are illustrated in Figure 4.

Fig. 4.

6.2.1 Hardware-in-the-Loop.

HiL testing usually introduces the real ECU hardware into the testing loop, as shown in the green box in Figure 4. There is a line of work [270, 271, 272, 273] that adopts this testing method. Chen et al. [270, 271] propose an HiL testing platform that could simulate multi-agent interaction on large-scaled scenarios with the usage of OpenStreetMap [274]. Brogle et al. [272] build their HiL platform based on Carla and ROS, which achieves high fidelity in vehicle dynamics and sensor data output. Gao et al. [273] design another HiL platform for AEB testing and find that the performance of the AEB functions in HiL tests is close to that in real road tests.

6.2.2 Vehicle-in-the-Loop.

Different from HiL testing, ViL testing works by integrating a synchronized virtual scenario into a real vehicle, as shown in the gray box in Figure 4. The following works [275, 276, 277, 278, 279] we collected are all based on this idea. Chen et al. [275] propose a ViL testing platform that can reconstruct scenarios based on the corresponding HD map. For simulating more realistic scenarios, these works [276, 277, 278] integrate popular traffic simulators, such as SUMO [280] and VISSIM [281], into the ViL testing loop. Stocco et al. [279] utilize the Donkey Car platform [282] to build a 1:16-scale car that is controlled by end-to-end driving models. They test these driving models in a closed-track environment and study the transferability of failures between simulation and the real world.

6.2.3 Scenario-in-the-Loop.

SciL testing narrows the gap between simulator and the real world by integrating more real components like the pedestrian dummies into the loop, as shown in the blue box in Figure 4. Szalay et al. [283] first propose the concept of SciL testing in, and they develop a SciL testing platform based on SUMO and Unity [284] in a later work [285]. Horvath et al. [286] study the SciL testing by comparing the implementation process of this method with that of ViL testing. The authors find that the two testing methods have the same basis, but SciL testing is still at an early stage.

6.3 Simulation-based Testing vs. Real-World Testing

Regarding the efforts in simulation-based testing, a natural question arises that, how far is the simulation-based testing still from the real-world testing. Moreover, Kalra et al. [287] find that the ADS should be driven hundreds millions of miles to demonstrate their reliability. As an emerging issue, this topic has attracted increasing research attention; here, we introduce the latest progress from two perspectives, namely the realism of test cases and the realism of simulators.

Realism of test cases. One question arises in the simulation-based testing whereby the virtual scenarios generated by testing algorithms that lead to system failures may never happen in the real world. Indeed, simulators create a wide range of traffic participants of which, nevertheless, only a subset can really happen.

There is a line of work that aims to bridge this gap and thus generate natural scenarios for ADS testing. Nalic et al. [288] propose a co-simulation framework using two simulation tools, CarMaker (for vehicle dynamics) and VISSIM (for traffic simulation); their framework can generate scenarios based on calibrated traffic models derived from real-world data. In their later work [289], a stress testing method, which has been introduced in Section 6.1.1, is applied for increasing the number of detected critical scenarios under the co-simulation environment. Klischat et al. [290] utilize OpenStreetMap to extract real-world road intersections and combine with SUMO to generate realistic traffic scenarios. Wen et al. [291] focus on triggering the events in a specific area near the ego vehicle, and a convolutional neural network (CNN) based selector is utilized to choose those scenario agents, which could achieve more realistic results.

The following works [292, 293, 294, 295, 296] focus on reconstructing scenarios from public crash reports. Mostadi et al. [295] utilize a distance metric, i.e., Manhattan distance, to align the virtual scenarios to real-world scenarios. Computer vision algorithms, i.e., object detection and tracking, are adopted in References [292, 296] to extract the trajectories of the vehicles from the crash videos. Gambi et al. [189, 293, 294] utilize natural language processing techniques to extract the relevant information and then calculate the abstract trajectories for recreating the crash. Experimental results show that the method could accurately reconstruct the crashes in public reports, and the generated test cases are able to expose faults in open source ADS, i.e., DeepDriving [207].

There is also a line of work [297, 298, 299, 300] that focuses on narrowing the reality gap in the training process. By including components such as augmented data, small-scale cars, and real-world tracks, they could generate more realistic cases to train the perception model or reinforcement learning algorithms for automated driving.

Realism of simulators. A comparative study on the assessment of testing in different levels of simulation is performed by Antkiewicz et al. [301]. In their work, the authors study simulation-based testing, mixed-reality testing, and real-world testing on two scenarios, i.e., car following and surrogate actor pedestrian crossing. They propose various metrics, e.g., realism, costs, agility, scalability, and controllability, and based on these metrics, they compare the different testing schemes under evaluation. As their conclusion, they quantitatively show the performance difference among the testing schemes: although real-world testing is better in terms of realism, it is more costly, and less agile, scalable, and controllable, compared to simulation-based testing; the performance of mixed-reality testing is in the middle of them. Testing ADS in different simulators is studied by Borg et al. [302], in which they utilize search-based testing techniques to generate scenarios in two simulators, i.e., PreScan [303] and Pro-SiVIC [304]. They find notable differences of the test outputs, e.g., they detect different safety violations. Consequently, they recommend involving multiple simulators for more robust simulation-based testing in the future.

Although simulation-based testing cannot achieve the same realism as real-world testing, to what extent can the results of simulation-based testing benefit real-world testing? This question is investigated in Reference [305], where the authors perform simulation-based testing to identify critical scenarios and map them to a real-world environment. Their key insights show that \(62.5\%\) of the unsafe scenarios detected by the simulators translate to real collisions and \(93.3\%\) of the safe scenarios with the simulators are also safe in the real world. Another question is whether the simulator-generated dataset can substitute real-world datasets for DNN-based ADS testing, which has been studied in Reference [306, 307]. Moreover, they also compare offline testing, e.g., module-level testing, and online testing, e.g., system-level testing, in terms of their pros and cons. Experiments on DNN-based ADS show that the average prediction error difference on two datasets is less than 0.1, which means the simulator-generated dataset can serve as an alternative to the real-world dataset. Online testing is more suitable than offline testing for DNN-based ADS testing, since online testing could detect more errors, i.e., those errors caused by accumulation over time, than offline testing. Reway et al. [308] evaluate the simulation-to-reality gap by testing an object detection algorithm under three different environments, namely a real proving ground and two simulation software, considering four weather conditions. The gap is quantitatively calculated by considering metrics such as precision and recall on each platform. One of their experimental results is that the gap between real and simulation domains under nighttime and rainy conditions is larger than that under daytime conditions.

6.4 Answer to RQ2

Overall, we have surveyed more than 90 papers dedicated to the system-level testing of the ADS. We find that those module-level testing techniques, such as search-based testing, sampling, and adversarial attack, are also widely adopted for finding failures arising from collaborations over different modules at the system level. Besides, more metrics, which can be found in Sections 6.1.2 and 6.1.3, are proposed or utilized for facilitating the testing process. Another observation is that more than 30 papers focus on bridging the gap between the simulation and the real-world environments, e.g., by introducing real components into the testing loop or by making a comparison between the simulation-based testing and the real-world testing.

Similarly to module-level testing, there still remain several open challenges for system-level testing of the ADS. For example, since the system executions during testing are expensive and time-consuming, it needs future exploration on how to accelerate the testing process. More discussions about the challenges and future research directions can be found in Section 8.

7 Statistics AND Analysis of Literature

In this section, based on the survey results in Sections 4, 5, and 6, we perform a statistical analysis. Specifically, we provide the threat model for general ADS in Section 7.1, and we collect popular datasets, tool stacks, and programming languages for ADS testing in Section 7.2.

7.1 The Threat Model for General ADS

In this section, we construct a threat model to summarize the safety and security threats that each module may confront based on our survey results. To build the threat model, we first summarize the threats discovered in the papers that we survey in the module-level testing in Section 5; then, as a complement, we review the bugs shown in the empirical studies [34, 35] on open source ADS to understand the concrete issues encountered in each module during system development. Our threat model is shown in Figure 5.

Threats to sensing. In this module, existing studies mainly concern about the hardware aspect, e.g., those physical sensors that are critical hardware used in an ADS for collecting the information of the external environments. A common threat such as harsh weather conditions could reduce the capabilities of the intelligent sensors. There are also many deliberate attack techniques, such as jamming attack [52, 53, 54] and spoofing attack [55, 56, 57, 58] (see details in Section 5.1.2) that target this module and could interfere with these sensors and harm their normal functionalities.

Fig. 5.

Threats to perception. The perception module is the most investigated, and we collect 23 testing techniques dedicated to this module. Common threat comes from adversarial examples that are generated by adding perturbations to normal images, which can fool the deep learning models in the perception module to make incorrect predictions, as shown in References [61, 62, 63, 64, 65, 65, 66, 67, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 83]. Another type of threat is called Trojan attack [88, 89], in which malicious data are injected into the training data of the deep learning models. Moreover, in the case when the ADS requires an HD map from the cloud service, Denial of service [309] or fake HD map data [310] can interfere with perception tasks such as localization.

Threats to planning. With the data produced by the perception module, the planning module takes charge of several tasks, e.g., object trajectory prediction and path planning, and we collect eight testing techniques for this module. A common threat comes from the unwanted maneuvers of NPCs [137, 138, 139, 141, 142, 143, 144, 145], which can interfere with the prediction for moving objects and thus lead to an unsafe trajectory plan. Moreover, improper localization from the perception module can also threaten the accuracy of output trajectories.

Threats to control. This module mostly adopts those mature control techniques, e.g., MPC and PID, and thus are relatively hard to attack. One threat comes from the injected faults [151, 152], which could affect the longitudinal/lateral control of the vehicle. Another threat emerges when an emergency situation is encountered, e.g., when an emergency braking is needed, and the control module may fail to handle these cases. Moreover, the output signals are sent via the CAN bus to the ECU for controlling the vehicle. Since this process involves a data transmission between software and hardware, a potential threat is the interface mismatch [311], e.g., an inappropriate steering angle rate, in practical usage.

7.2 Datasets and Tool Stacks for ADS Testing

Datasets. In the context of ADS, deep learning components handle safety-critical tasks, e.g., perception and end-to-end control, so it is necessary to validate their robustness under various scenarios. This process typically relies on data from the real world, which is, however, hard to obtain in general. Fortunately, there are a collection of publicly available datasets to solve the problem, which involve large quantities of real-world pictures and videos recorded by onboard sensors. For example, the KITTI dataset [313] contains over 10,000 images of traffic scenarios, collected by a variety of sensors including high-resolution RGB/grayscale stereo cameras and a 3D laser scanner.

In this section, we summarize the scenario-driven datasets for ADS testing in Table 10. The first column shows the time when each dataset was released. The next three columns give the name, brief description, and the size of each dataset, and the last column indicates the related works that adopt these datasets. Note that the datasets for other machine learning testing tasks [337] that have nothing to do with ADS testing are not listed here; in other words, all the datasets listed here are dedicated to ADS testing.

Table 10.

Time	Dataset	Description	Size	Reference
2004	NGSIM [312]	Vehicle trajectory data at four different locations	—	[143]
2012	KITTI [313]	Driving scenes captured by a standard station wagon	12,919 images	[58, 72, 73, 84, 99, 100, 167, 172, 247, 292, 292]
2013	GTSRB [125]	Containing 43 classes of traffic signs in Germany	50,000+ images	[66, 80, 83, 88]
2014	BelgiumTS [314]	A large dataset with traffic sign annotations	10,000+ images	[88]
2015	LISA [315]	A traffic sign dataset containing US traffic signs	43,000+ images	[80]
2016	Cityscapes [316]	A diverse set of stereo video sequences recorded in street scenes	25,000 images	[65, 92]
2016	Udacity [317]	Video frames taken from urban road	410,530 images	[107, 166, 167, 172, 173, 174, 177, 306]
2016	SYNTHIA [318]	Multiple categories of virtual city rendering pictures	220,000+ images	—
2016	Stanford Drone[319]	The movement and dynamics of pedestrians across the university campus	69-GB videos and images	—
2017	RobotCar [320]	Various combinations of weather, traffic and pedestrians, as well as long-term changes such as road engineering	—	—
2017	CityPersons [321]	A dataset with a large proportion of blocked pedestrians images	5,050 images	—
2017	Mapillary Vistas [322]	Street view of multiple cities under multiple seasons and weather conditions	25,000 images	—
2018	GTA5 [323]	Synthetic images of urban traffic scenes collected using the game engine	24,966 images	[99]
2018	BDD100K [324]	Various scene types and weather conditions at different times of the day	100,000 videos	—
2018	comma2k19 [325]	Over 33 hours of commute in California’s 280 highway	33h videos	[235]
2018	highD [326]	Traffic conditions of six different locations obtained by drone	147h videos	—
2018	ApolloScape [327]	Images under different conditions and traffic density	146,997 images	[94]
2019	ACFR [328]	Vehicle traces at 5 Roundabouts	23,000 images	—
2019	nuScenes [329]	Images under different times of day and weather conditions	1,400,000 images	—
2019	INTERACTION [330]	A dataset collected under interactive driving scenes with semantic maps	—	—
2019	Waymo [103]	Including a perception dataset with high-resolution sensor data and labels, and a motion dataset with object trajectory and corresponding 3D map	493,354 images	—
2020	inD [331]	Naturalistic trajectories of vehicles and vulnerable road users recorded at German intersections	10h videos	—
2020	Ford [332]	Multiple seasons, traffic conditions, and driving environments	—	—
2020	rounD [333]	Naturalistic trajectories of vehicles and vulnerable road users recorded at German roundabouts	—	—
2020	openDD [334]	A trajectory dataset covering seven roundabouts	62h videos	—
2021	Bosch Small Traffic Light [335]	An accurate dataset for vision-based traffic light detection	13,427 images	[58, 95]
2022	CrashD [336]	A synthetic LiDAR dataset to quantify the generality of 3D object detectors on out-of-domain samples	—	[300]

Table 10. Scenario Driven Datasets for ADS Testing

As shown in Table 10, we collect 27 datasets released from 2004 to 2022, including popular ones like the KITTI dataset [313] and emerging ones like the CrashD [336] dataset. One observation is that these datasets span over various physical conditions, e.g., different times of the day [317], different weather conditions [329, 332], and different traffic density [327]. They also span over various application scenarios, such as urban street [316, 320, 322], highway [325, 326, 338], and intersection [331]. In addition, we find that some of these datasets are specific to a certain task, e.g., pedestrian detection [319, 321] and traffic sign detection [125, 314, 315].

As the column of references in Table 10 shows, several datasets such as the KITTI dataset [313] and the Udacity dataset [317] are frequently used in ADS testing due to the diverse tasks they support, such as object detection and semantic segmentation. However, we also find that a number of datasets have not been widely used, due to their own limitations. For example, the rounD [333] and openDD [334] can only be used for validating the behavior planning of ADS in the scenario of roundabouts; SYNTHIA [318] and GTA5 [339] contain synthetic images from virtual environments, which may be not realistic enough for ADS testing.

Tool stacks. As mentioned, simulation-based testing has become an important alternative approach for real-world testing. Simulators usually provide vehicle dynamics, e.g., longitudinal and lateral motion of the vehicle, and virtual traffic scenarios. Moreover, simulators can help generate those extreme scenarios for testing, e.g., harsh weather, which are rarely encountered in the real world. There have been many advanced simulation platforms developed for ADS testing in recent years. For example, Carla [345] is an open source simulator for ADS training and testing, which supports various sensor models and environmental conditions.

In this section, we summarize the simulation platforms, namely the simulators usually used for ADS testing in Table 11. As shown in the table, we collect 20 simulation platforms, including classical platforms such as Matlab/Simulink [340] and CarSim [341], and emerging popular simulators, such as Carla and LGSVL [351]. Since these simulators have their own pros and cons, we compare them in the table and focus on several aspects of interest, e.g., their gap from real environment. Specifically, the first column lists the name and the second column shows the accessibility of each simulator. The third column is relevant to physical aspects, that is, whether the simulator allows for customizing a dynamic model and whether it is a soft-body or rigid-body based simulator. The fourth column indicates the level of support for mixed-reality testing, including model-in-the-loop (MiL), software-in-the-loop (SiL), HiL, and ViL. The fifth column presents the capability of these simulators to complement each other, e.g., whether they support co-simulation with other simulators. The last column indicates the related works that adopt these simulators in their research.

Table 11.

Simulator	Open source	Vehicle dynamic		X-in-the-loop				Interface to other simulators	Reference
Simulator	Open source	customization	soft/rigid	MiL	SiL	HiL	ViL	Interface to other simulators	Reference
Matlab/Simulink [340]	\(\times\)	\(\checkmark\)	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	—	CarSim, CarMaker, PreScan, Gazebo, Carla, rFpro, VTD, Cognata, ADAMS Pro-SiVIC	[162, 163, 248] [188, 204, 277] [153, 218, 257] [195, 198]
CarSim [341]	\(\times\)	\(\checkmark\)	rigid	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	—	Matlab/Simulink, rFpro, NVIDIA Drive Sim, VTD, Pro-SiVIC, Donkey Car	[195, 257]
VISSIM [281]	\(\times\)	\(\times\)	—	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Carla, VTD, PreScan, CarMaker, rFpro, SUMO	[288, 289]
SUMO [280]	\(\checkmark\)	\(\times\)	—	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Carla, VISSIM, Cognata, rFpro	[194, 242, 277] [283]
Webots [342]	\(\checkmark\)	—	rigid	—	\(\checkmark\)	—	—	—	[247, 292]
VTD [343]	\(\times\)	\(\checkmark\)	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	CarSim, Matlab/Simulink, ADAMS, VISSIM, rFpro	[260, 308]
Gazebo [344]	\(\checkmark\)	—	rigid	—	\(\checkmark\)	\(\checkmark\)	—	Matlab/Simulink, ADAMS	[254, 271, 275]
PreScan [303]	\(\times\)	—	rigid	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Matlab/Simulink, VISSIM, Pro-SiVIC	[166, 188, 306] [195, 302]
BeamNG [25]	\(\checkmark\)	\(\checkmark\)	soft	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	—	[189, 193, 293] [190, 200, 293] [191, 192]
Carla [345]	\(\checkmark\)	\(\times\)	rigid	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	CarSim, VISSIM, SUMO, Matlab/Simulink	[101, 171, 346]
AirSim [347]	\(\checkmark\)	\(\times\)	rigid	—	\(\checkmark\)	\(\checkmark\)	—	—	—
rFpro [348]	\(\times\)	\(\checkmark\)	rigid	—	\(\checkmark\)	\(\checkmark\)	—	CarSim, Matlab/Simulink, CarMaker, VISSIM, VTD, SUMO	—
Cognata [349]	\(\times\)	\(\checkmark\)	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	-	Matlab/Simulink, SUMO	—
NVIDIA Drive Sim [350]	\(\checkmark\)	\(\checkmark\)	—	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	CarMaker, CarSim	—
LGSVL [351]	\(\checkmark\)	\(\times\)	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	—	—	[72, 140, 196] [235, 252, 305]
SCANeR Studio [352]	\(\times\)	\(\checkmark\)	soft/rigid	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	—	[164]
ADAMS [353]	\(\times\)	\(\checkmark\)	rigid	-	\(\checkmark\)	\(\checkmark\)	—	Gazebo, Matlab/Simulink, VTD	—
CarMaker [222]	\(\times\)	\(\checkmark\)	rigid	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Matlab/Simulink, VISSIM, rFpro, NVIDIA Drive Sim	[214, 277, 288] [218, 221, 289] [198, 221, 308]
Pro-SiVIC [354]	\(\times\)	\(\checkmark\)	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Matlab/Simulink, CarSim, PreScan	[302]
Donkey Car [282]	\(\checkmark\)	\(\times\)	—	—	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Matlab/Simulink	[279]

Table 11. Simulation Platforms for ADS Testing

Based on the table, we can draw the following conclusions:

•

First, there exist many commercial simulators that are not open source, e.g., CarMaker [222] and PreScan [303]. These simulators could be expensive and difficult for researchers to satisfy their research goals. In comparison, open source simulators like Carla could have broader prospects for future research;

•

Second, accurate physical dynamic models are needed to bridge the gap between simulation-based testing and real-world testing and satisfy different testing requirements, e.g., smooth road needs a lower friction coefficient. We find that there have been simulators, i.e., BeamNG [15] and CarSim, dedicated to this aspect and allowing for dynamic model customization. In particular, BeamNG is also a soft-body based simulator that supports more realistic collision effects (see more details in Section 2.3);

•

Third, we find that most simulators support software-in-the-loop and hardware-in-the-loop testing. Several simulators, e.g., CarMaker and VTD [343], support vehicle-in-the-loop testing, which closes the gap between hardware-in-the-loop testing and real-world testing.

•

Last, we find that a number of simulators have built-in interfaces to other simulators. This is essential to perform co-simulation for ADS testing, since these simulators have their own pros and cons, and co-simulation could complement each other for a more realistic testing environment. For example, CarMaker (accurate vehicle dynamics) and VISSIM (representational traffic flow) are combined into a co-simulation framework [288] for generating more realistic testing scenarios.

Overall, it can be seen from the reference column that Matlab/Simulink and BeamNG have been widely used for ADS testing. Simulators like NVIDIA Drive Sim and Gazebo also have great potential for future research, since they cover multiple features we list in the table, e.g., whether they could perform ViL testing or co-simulation with other simulators.

Moreover, we also introduce several publicly available systems under test in Table 12. OpenPilot, Apollo, Autoware, and Pylot are all modular systems and have already been described in Section 2.3, so we will not repeat them here. In addition to modular ADS systems, there also exist open source end-to-end based systems: LBC [356] is an imitation learning controller that uses camera images and direction commands as input to control the direction of the vehicle in the lane and intersection; DeepDriving [357] and Nvidia CNN Lane Follower [358] are also widely used CNN-based end-to-end controllers; and Open source DNN models from the Udacity self-driving challenge, such as Chauffeur [179] and Epoch [180], are another line of end-to-end driving controllers. Besides, there are also driving agents and controllers from simulators, such as BeamNG and Carla. BeamNG.AI, which has been mentioned in Section 2.3, is an AI agent in BeamNG simulator and could accept virtual images in the simulator as input for path planning and trajectory tracking. Carla PID is a specific module in Carla that performs calculations at the motion planning stage and estimates the acceleration, braking, and steering inputs required to reach target positions.

Table 12.

System Category	Name	Description
Modular	Apollo	A commercial-grade ADS developed by Baidu
	Autoware [355]	A L4 ADS developed by Nagoya University
	OpenPilot	A commercial-grade L2 ADAS developed by Comma.ai
	Pylot [23]	A modular ADS with low-latency dataflow from academic
End-to-end	LBC [356]	An end-to-end controller based on imitation learning
	DeepDriving [357]	A CNN based end-to-end system that provides ACC and ALC
	Nvidia CNN Lane Follower [358]	An end-to-end lane following system based on CNN
	Udacity DNN Models [359]	DNN-based steering prediction models from the Udacity challenge, e.g., Chauffeur [179] and Epoch [180]
Other	BeamNG.AI [15]	An AI agent in BeamNG, which can realize simple control of vehicles
Other	Carla PID [345]	A specific controller built in Carla

Table 12. Open Source ADS

Programming languages. To systematically generate test cases, it has become a trend to propose new programming languages for testing scenario description. In this way, the generation of a new test case boils down to writing a program that describes the scenario. Also, researchers can make use of existing coverage criteria for programs, such as the code coverage criteria, to assess the adequacy of the generated tests.

To define such a programming language, researchers need to formally express the basic elements in an ADS scenario, e.g., the ego car, other cars, pedestrians, and static objects. Since these languages are usually dependent on existing formats, they vary in their ways of expressing those elements, based on their dependent formats. For instance, Scenic [361], a pythonlike language, requires users to define those objects as variables; in contrast, GeoScenario [360] provides users with a graphical interface where users can drag the icons to describe a scenario. Moreover, these languages usually do not emerge independently; instead, they come with specific simulators or even specific ADS.

In this section, we summarize the state-of-the-art programming languages for test case generation in Table 13, and introduce their dependent formalism, their bonded simulator, other features, and their adoption in ADS testing. There exists literature, e.g., Reference [367], that surveys programming languages for the test generation of ADS. Compared to Reference [367], our main aim is to show the ecosystems and the landscape of the use of these languages in ADS testing, as a reference for the readers to better understand the testing techniques in Sections 5 and 6. Also, our study includes some latest achievements, e.g., paracosm [366] and SceGene [365], in this direction.

Table 13.

Language	Dependencies	Supported simulators	Other features	Reference
OpenScenario	Unified Modeling Language (UML), XML	Carla, Matlab, PreScan	A scenario is described in a “Storyboard” tag in XML, which includes a series of events	[101]
GeoScenario [360]	XML	An Unreal-based driving simulator	The language is based on open street map standard. Users can either program by dragging icons, or code in an XML editor.	—
Scenic [361]	Imperative, object oriented (Python-like)	Carla, LGSVL, Webots	It is a probabilistic programming language that can specify the input distributions of machine learning components, and use that information for testing and analysis.	[305]
stiEF [362]	Domain specific language	VTD	It supports multilingual representations for scenario description.	—
SceML [363]	Graph-based modeling framework	Carla	It allows information modeling at different depths, to support scenarios at different abstraction levels.	—
CommonRoad [364]	XML	SUMO	It provides a benchmark set that contains scenarios for the study of motion planning.	[143]
SceGene [365]	A hierarchical representation model	—	It supports scenario generation via bio-inspired operations, such as crossover and mutation.	—
paracosm [366]	Reactive programming model	Udacity’s self-driving simulator	It adopts reactive objects that allow to describe temporal reactive behavior of entities. It also defines coverage criteria for test case generation.	—

Table 13. Programming Languages for Scenario Generation

As shown by Table 13, we collect eight representative programming languages, including the classic ones, such as OpenScenario, that have been widely used in different stages of the development of ADS, and emerging ones, such as paracosm [366]. As our findings, first, these languages offer distinct capabilities for different purposes, e.g., languages are designed for different purposes and attached with different features, e.g., Scenic [361] allows probabilistic sampling for testing driving systems with machine learning components; SceGene [365] designs bio-inspired operations, such as crossover and mutation, for scenario generation. Second, some of these languages provide more user-friendly features; for instance, some of the languages, e.g., SceML [363] provide a graphical user interface for users to define their scenarios. However, as the column of reference shows, most of these languages have not been widely adopted in practice. This can be due to several reasons: One possibility is that some languages are still too specialized for practitioners to adopt them in their work; also, since many of the languages, such as GeoScenario [360], are designed for specific systems, they are still ad hoc and not easily extensible to be adopted in other systems.

In conclusion, programming languages are increasingly deemed as powerful weapons for test case generation in ADS testing, but currently they have not been widely adopted in practice.

8 Challenges AND Opportunities

As this survey reveals, ADS testing has experienced rapid growth in recent years. Nevertheless, there are still many challenges and open questions in its development and deployment. Based on our analysis of the collected literature and our discussions in each section, we answer RQ3 by listing the challenges and opportunities in this direction, as shown in Figure 6. To account for it, there exist several solutions to the first four challenges that could be improved, while the last three challenges still need further exploration and require a long period of research. Efficient test generation methods. Efficiency is one of the most important objectives in ADS testing, since system executions, whether in simulator environments or the real world, are too expensive. There have been many methods that aim to reduce the number of system executions, e.g., training surrogate models [202, 203, 204, 205], or adopting sampling-based methods [217, 218, 219, 221, 223, 224, 228, 229, 230], as discussed in Section 6.1.1. However, there are several limitations to these methods; for example, the process of preparing training data in Reference [202] for surrogate models is time-consuming. One potential future direction is to explore the application of traditional cost reduction techniques, such as test selection and test prioritization, to further accelerate the testing process.

Fig. 6.

Realism of test cases. Generating realistic test cases that can really threaten the safety of ADS in the real world should be another important goal of test case generation. Unrealistic test cases that cannot happen in the real world are meaningless and not worthy of being addressed. However, compared to efficiency, this aspect is usually ignored. Generating realistic test cases is a demand over different modules, and some existing works have paid attention to this problem. For example, in the perception module testing, RP \(_2\) [80] is proposed to generate test cases under real physical conditions; in the planning module testing, avoidable collision [148] is proposed to filter out useless test cases; moreover, this is also a major issue in system-level testing, as discussed in Section 6.3. In addition to these efforts, the problem is worthy of more attention to find those really useful test cases.

Oracle problem for different modules. Although there have been many works that try to design suitable oracles for different modules of ADS, there still remain many open challenges in defining oracles regarding different characteristics of different modules. For the perception module, as discussed in Section 5.2.4, the automatically labeling method in Reference [92] targets only semantic segmentation models, so one future direction is to explore how to automatically generate high-fidelity ground-truth labels for other types of models in the perception module. For the planning module, as discussed in Section 5.3.4, the criteria such as avoidable collision [148] are ad hoc and may not be generalized to other systems. Metamorphic relations are widely adopted by works [94, 95, 96, 97, 98, 99, 165, 173] for different modules, but they may lack sufficiently accuracy and so lead to false positives. Hence, one potential future direction is to design more accurate and reliable oracles for the testing of different modules in ADS.

Effective coverage criteria. Coverage criteria are used as guidance to generate diverse test cases for testing. As discussed in Sections 5 and 6, various coverage criteria have been proposed for testing different modules of the ADS, e.g., neuron coverage [107] for perception and end-to-end modules and weight coverage [137] and route coverage [140] for the planning module. Notably, few coverage criteria have been proposed for the control module, which indicates a future research direction. Moreover, one problem in the existing studies is that they mainly consider covering the spatial aspects of the test cases; for instance, neuron coverage [107] is computed based on the activated neurons in a DNN model and used as a guidance to trigger diverse behavior of single DNNs. However, in the testing of ADS that run over a time period, even though a strange behavior for a moment is triggered to happen, if it is immediately corrected, then it may not affect the system behavior over the period. Therefore, in the testing of ADS, we need to trigger the diverse behavior of the DNNs over time. For instance, if a DNN keeps making wrong predictions for a period, then it is likely to lead to a collision. Besides, several studies [134, 135, 136] have demonstrated that neuron coverage may not be suitable for guiding the testing process. Whether these findings will effect the ADS testing or there exists more effective criteria dedicated to perception testing needs further exploration. To sum up, coverage criteria dedicated to the control module are expected to be proposed in the future, and another research direction is to consider incorporating the temporal aspects into the existing coverage criteria.

Online monitoring. In this work, we mainly see testing techniques for ADS based on the posterior checking of the system execution; another effective quality assurance scheme is online monitoring [368, 369] that monitors the system behavior at runtime. As an advantage, online monitoring can detect unsafe behavior during the system execution and thus warn drivers to take actions to avoid the safety risk. As discussed in Section 5.2.4, there have been some works, e.g., Reference [101], that rely on formal temporal specifications to monitor the perception module at runtime. Besides, the model-based oracle proposed by Stocco et al. [175] is also a system-level online monitoring approach, as it predicts the misbehaviors of the system at runtime. However, finding a way to automatically monitor other modules of the system remains a great challenge. One potential future direction is to develop monitoring techniques for other modules. Meanwhile, more expressive specification languages should be provided to handle real-world system requirements.

Fault analysis of system failure. As this survey shows, the function of an ADS relies on the collaborative work of different modules; indeed, the wrong function of any module can cause a failure at the system level. Therefore, one question arises regarding which module should be deemed as the main cause of the system failure. Currently, as discussed in Section 6.1.4, the research attention is mostly focused on failure detection rather than fault analysis. Moreover, fault analysis of ADS is challenging in nature, because it requires us to define the boundaries of each module properly and make the oracles of each module clear. Sometimes, the failure of the system is not due to single modules but to the interactions between different modules. Therefore, one future direction is to propose effective fault analysis techniques as well as their validation methods.

Simulators vs. real world. Because of the high cost of real-world testing, simulation-based testing is the most commonly used testing paradigm; however, even with modern high-fidelity simulators (e.g., Carla and LGSVL), there is still a gap from real-world testing. Recently, lightweight mixed-reality testing schemes, including HiL, ViL, and SciL (more detail in Section 6.2), that mix the simulation-based testing and the real-world testing, also achieve a tradeoff between the two. While HiL and ViL testing have developed quickly over the years, SciL testing, which is closest to the real world, is still at a theoretical stage and has not yet been widely adopted. As discussed in Section 7.2, existing simulators all have their pros and cons, and one future direction is to combine their distinguishing features, e.g., co-simulation, to enhance the realism of the simulation environment. Moreover, there have been several works in Section 6.3 that try to estimate how far the simulation-based testing is from the real-world testing. Nevertheless, in the case of handling complex traffic scenarios in testing, there are still open questions, such as the selection between simulation-based testing and real-world testing, and how to mitigate the weaknesses of the selected testing paradigm, that are searching for better answers. To summarize, the gap between simulation-based testing and real-world testing still exists, and one research direction is to explore how to utilize the results of simulation-based testing to reduce the cost of real-world testing.

Answer to RQ3. Based on our survey results, we identify seven major challenges for ADS testing and discuss the corresponding potential research opportunities. Moreover, as shown in Figure 6, we find that several challenges such as the efficiency of test generation could be improved in the short run; by contrast, some other challenges (for example, how to mitigate the gap between simulation and real-world environments) may require a long period of research.

9 Conclusion

This survey provides a comprehensive overview and analysis of the relevant studies on ADS testing. These testing works cover both module-level testing and system-level testing of ADS, and we also include the works on empirical study w.r.t. system testing, mixed-reality testing, and real-world testing. In the introduction to the testing of each module, we respectively unfold the landscape of the literature from three perspectives, namely test methodology, test oracle, and test adequacy. Based on the literature review, we also perform analysis on the landscape of ADS testing and propose a number of challenges and research opportunities in this direction.

Our work gives a specific emphasis on the technical differences in the testing of different modules and also reveals the gap between simulation-based testing and real-world testing. Moreover, our analysis and discussion on the challenges and opportunities based on the literature review point out the future direction of research in this field. We hope that this work can inspire and motivate more contributions to the safety assurance of ADS, and we also hope that ADS can be sufficiently reliable to be adopted by more people as early as possible.

Footnotes

https://www.sae.org/standards/content/j3016_202104/.

https://sites.google.com/view/ads-testing-survey.

https://dblp.org/.

⁴

https://scholar.google.com/.

⁵

https://discord.com/.

⁶

Search-based testing and Fuzzing are similar concepts coming from different communities. The former emphasizes on the testing methodology via search, which relies on well-defined fitness functions and applies search heuristics, e.g., evolutionary algorithms, to find the target test cases. Fuzzing comes from the security community and its methodological essence lies at its randomness. Similarly to search-based testing, fuzzing also comes with an objective function as a guidance that helps it achieve the target more efficiently.

References

[1]

Mordor Intelligence Inc. 2022. Autonomous (Driverless) Car Market—Growth, Trends, COVID-19 Impact, and Forecast (2022–2027). Retrieved from https://www.mordorintelligence.com/industry-reports/autonomous-driverless-cars-market-potential-estimation.