Typically, in a cyber-physical system (CPS), timing jitters between sensing and actuation adversely affect its physical behavior. Logical execution time (LET) paradigm has gained industry attention because it offers zero jitters in tasks’ response time. In autonomous CPS such as self-driving cars, Robot Operating System 2 (ROS2) is becoming a popular software platform to implement computation tasks. Towards a practical LET implementation of ROS2 computation chains, we propose to use at least two table-driven reservation servers: (i) one exclusively runs the processor thread that publishes the chain’s output while (ii) others isolate the chain’s main computations from best-effort workloads on the same processing unit. We show how to architecturally adapt the chain non-intrusively as well as dimension the servers and allocate appropriate threads to them so that we obtain negligible jitters while keeping the observed maximum end-to-end latency comparable to a high-priority execution of the chain. Also, our approach is versatile and can produce latency bands, thereby offering an opportunity to co-optimize jitters and average end-to-end latency. This further supports multi-mode application design, which is especially important for a high-performance operation in different environments, e.g., city and highway driving. Our idea does not involve modification or recompilation of the application code or ROS2 libraries, which is crucial in industry settings. We have also developed tools to automate our approach. By applying our proposed mechanism on a real-world benchmark implementing Lidar-based localization, we maintain a constant end-to-end latency that is even 13% shorter than an improved LET implementation.

1 Introduction

The advancement in artificial intelligence (AI) and computer vision (CV) is driving cyber-physical systems (CPS) towards autonomy, e.g., self-driving cars, drones, and robots. Robot Operating System 2 (ROS2) is becoming pivotal in this drive due to (i) its vast repertoire of open-source implementations of AI and CV algorithms, (ii) its portability to different operating systems (OS), and (iii) its support for collaborative development of complex software applications.

Software applications over ROS2 consist of modular functional components called nodes. A node comprises callbacks to handle timer and communication events. The nodes send and receive data using topics following Data Distribution Service (DDS) standard. An application can contain computation chains formed by a sequential invocation of callbacks connected via topics. In autonomous systems, a chain reads and processes sensor data, performs planning, executes control logic, and applies actuation signals. [20].

Figure 1:

A long and unpredictable end-to-end latency of such a chain can lead to a lower performance or even to unsafe physical behavior [31, 39]. Static and dynamic priority-based scheduling are studied in ROS2 to reduce the end-to-end latency, e.g., [3, 9, 15, 50]. However, none of the existing works consider to minimize the jitters or variations in the end-to-end latency of ROS2 chains.

Typically, jitters are eliminated using techniques [7, 18, 21, 22, 33, 34] following the Logical Execution Time (LET) concept [25]—the de-facto industry standard, e.g., see AUTOSAR [4]. Figure 1 a shows a conventional LET implementation where, at the beginning of a period, a task first writes its output from the previous execution, then reads the new input and starts processing it. Hence, a task spends a constant time—equal to its period—between data read and write. Also, a chain has a constant end-to-end latency. Nevertheless, existing LET implementations are not compatible with ROS2 chains due to the following reasons: (i) They require to control the timings of reading from and writing to topics, which is non-trivial for the more popular DDS communication mechanism. (ii) They require the callbacks to be time-triggered, which is not the case when one subscribes to a topic.

Proposed solution: Towards a practical LET implementation of a ROS2 chain, we use table-driven reservation servers. Such a server reserves specific time slots on a processing unit to run its assigned threads. We execute a high-critical (sub-)chain using high-priority servers and run best-effort workloads in low-priority servers, i.e., we consider hierarchical scheduling. This offers a high real-time performance as well as a good processor utilization. Unlike conventional time-triggered scheduling where computation time is reserved for each task based on its worst-case execution time (WCET), we assign chained callbacks to a server R_m, capable of running multiple threads. We dimension R_m based on the maximum measured end-to-end latency \(\overline{L}\) of the assigned chain during a priority-driven execution. Such measurement-based approaches are popular in the industry [2].

Further, we study two DDS communication mechanisms. (1) When asynchronous publish is used, one OS thread executes the computation logic while another publishes the data. (2) Both tasks are performed by the same thread using synchronous publish.

When the chain’s output is published asynchronously, we assign the corresponding publisher thread to a server \(R_{m^\circ }\) with a time slot immediately following R_m in time. Such a two-server scheduling of the chain enables latency shaping, i.e., it minimizes jitters while preserving \(\overline{L}\). Figure 1 b depicts this behavior where we get a constant end-to-end latency \(\approx \overline{L}\). It also shows that we relax the pessimistic LET assumption of one period per task’s response.

With the more popular synchronous publish mechanism, we face a technical challenge when the chain’s last callback has a large execution time variation while it is run by the thread that also publishes the chain’s output. We solve it by architecturally extending the chain by a callback that republishes the chain’s output and running this callback in \(R_{m^\circ }\). We perform this chain extension by exploiting ROS2-supported modular software development, i.e., without any modification or recompilation of the application source code. In fact, we provide a complete tool chain automating each step from profiling to running the chain using reservation servers.

\(R_{m^\circ }\) can also be configured with additional time slots offering a multi-latency implementation of the chain. Figure 1 c illustrates that the chain’s end-to-end latency falls in two distinct time bands when we appropriately configure \(R_{m^\circ }\) with two time slots per cycle. Such an implementation encourages multi-mode application design where, in each mode, the chain has a specific latency. This is particularly useful for autonomous systems running in different environments. Further, we apply particle swarm optimization [26] to place the time slots and try to minimize the average end-to-end latency—a performance measure commonly used in the industry.

Contributions: Our main contributions are as follows:

•

We introduce the concept of latency shaping to minimize jitters in a ROS2 computation chain with a negligible impact on the observed maximum end-to-end latency. We also demonstrate its versatility in supporting multi-mode chain operation without the need for dynamic schedule reconfiguration, which is crucial for next-generation autonomous systems.

•

We propose two chain-aware implementations of latency shaping that are compatible with ROS2 semantics.

•

We develop tools that will automatically (i) determine configurations for reservation servers and (ii) create and configure them as well as assign chain components to run inside them. In essence, we enable design automation for latency shaping.

•

We apply latency shaping on a real-world benchmark, i.e., for a chain from Lidar to vehicle pose estimation in Autoware’s Autonomous Valet Parking (AVP) [20].

•

We perform experiments to demonstrate that our proposed concepts can be applied (i) to implement a ROS2 chain following the conventional LET concept and (ii) to directed acyclic graphs (DAGs) comprising ROS2 callbacks.

Paper organization: Section 2 provides system model and briefly discusses our profiling-based timing model extraction tool. Using a synthetic ROS2 chain, it further compares priority-driven and time-triggered scheduling in terms of maximum end-to-end latency and jitters. In Section 3, we describe our proposed mechanism to manage the end-to-end latency of a ROS2 chain and its different implementations. Section 4 outlines (i) our problem formulation to place time slots for a multi-band latency shaping and (ii) a tool chain to automate latency shaping. We present different case studies in Section 5. We discuss certain complex scenarios for future considerations in Section 6. We discuss the related work in Section 7 and provide concluding remarks in Section 8.

2 Background and System Model

2.1 Modeling ROS2 applications as DAGs

ROS2 is a multi-layer middleware [35] that provides easy-to-use application program interfaces (APIs) to develop complex software applications. A major advantage is that it enables independent development of software modules that can be later easily composed together to run coherently even on a distributed platform.

In ROS2, a standalone software module which implements a particular functionality, e.g., object detection, is called a node. The main building blocks of a ROS2 node are event-handling callbacks. There are four types of callbacks, namely timer, subscriber, service, and client callbacks. Timer callbacks are triggered by periodic timer events. Communication between nodes are carried out via topics. When a node subscribes to a topic, new data on the topic triggers the designated subscriber callback to handle it. ROS2 offers a feature called service for blocking remote procedural calls (RPCs). Communications related to services are carried out using request and reply topics. A caller writes input arguments on a request topic to invoke the corresponding service callback. After processing the input, it writes its output on the reply topic that triggers the client callback.

Consider that a set of η nodes {N_i|1 ≤ i ≤ η} implements the software applications over ROS2. Each node N_i is further composed of a set Γ_i of γ_i callbacks, Γ_i = {cb_{i, j}|1 ≤ j ≤ γ_i}. We can model the applications using a DAG \(\mathbb {G}\) where each callback cb_{i, j} is a vertex in \(\mathbb {G}\). That is, the set of vertices \(\mathbb {V}\) in \(\mathbb {G}\) is given by Γ₁ ∪ Γ₂ ∪ … ∪ Γ_η. We draw an edge from cb_{i, j} to \(cb_{i^{\prime },j^{\prime }}\) when cb_{i, j} publishes on a topic T, i.e., \(T \in cb_{i,j}[\mathbb {PT}]\), while \(cb_{i^{\prime },j^{\prime }}\) subscribes to T, i.e., \(T \in cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]\). Further, cb_{i, j} can have multiple outgoing edges when (i) \(T \in cb_{i,j}[\mathbb {PT}]\) has more than one subscribers and/or (ii) cb_{i, j} publishes different parts of its result on multiple topics, i.e., \(|cb_{i,j}[\mathbb {PT}]| \gt 1\). In the same vein, \(cb_{i^{\prime },j^{\prime }}\) can have more than one incoming edge when (i) multiple callbacks publish on \(T \in cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]\) and/or (ii) \(cb_{i^{\prime },j^{\prime }}\) subscribes to two or more topics, i.e., \(|cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]| \gt 1\). For the latter case, we have seen an implementation of synchronizing callbacks using message filters in Autoware [49]. Such a callback \(cb_{i^{\prime },j^{\prime }}\) will execute its logic only when each of the subscribed topics has a new data. Further, we note that timer callbacks have no incoming edge because they do not subscribe to any topic. Others (i.e., subscriber, service and client callbacks) can have both incoming and outgoing edges.

2.2 Scheduler and thread models

We study the single-threaded executor in ROS2, which is mostly studied in the literature [13, 15, 44] and is also used in our benchmark application from Autoware. Such an executor runs all callbacks in a node N_i based on a certain scheduling policy using only one thread \(N_i[\mathcal {T}_{ex}]\). While it is possible to customize the scheduling inside executors [3, 15], our experiments only consider the default policy which is described in [13, 36]. This paper does not evaluate how different executor implementations influence application timings. However, we believe that our techniques can be trivially applied to all variations of single-threaded executors.

Further, the threads are scheduled by the OS. Here, we have studied two scheduling policies. (i) \(\mathbb {S}_1\): A fixed-priority preemptive scheduling provided by SCHED_FIFO in (Preempt_RT-patched) Linux. When using \(\mathbb {S}_1\), we assign a priority \(\mathcal {T}[Pr]\) to a thread \(\mathcal {T}\), e.g., \(N_i[\mathcal {T}_{ex}][Pr]\) gives the priority of the executor thread \(N_i[\mathcal {T}_{ex}]\) of node N_i. Here, larger the value of \(\mathcal {T}[Pr]\) is, higher is the priority of \(\mathcal {T}\) to run. (ii) \(\mathbb {S}_2\): A table-driven reservation-based scheduling provided by LITMUS^RT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems, a Linux kernel extension) [10]. When using \(\mathbb {S}_2\), we allocate a thread \(\mathcal {T}\) to a reservation server \(\mathcal {T}[RS]\), e.g., \(N_i[\mathcal {T}_{ex}][RS]\) denotes the server that runs \(N_i[\mathcal {T}_{ex}]\). Details about reservation servers are provided later in this section.

Overall, we see a hierarchical scheduling of ROS2 callbacks. That is, the OS scheduler selects an executor thread of a particular node to run while the executor then runs one of the node’s ready callbacks. Besides the executor thread, a set of threads, \(N_i[\mathbb {T}_{DDS}]\), is also spawned by the DDS layer for a node N_i to manage both intra- and inter-node communications [16, 19, 41]. In this paper, we study two mechanisms to publish a data on a topic [17]. (i) Asynchronous publish: When a callback cb_{i, j} writes its output on the topic \(T \in cb_{i,j}[\mathbb {PT}]\) using this mechanism, a publisher thread \(N_i[\mathcal {T}_{pub}] \in N_i[\mathbb {T}_{DDS}]\) is woken up that then executes the task of publishing the data to its subscribers. (ii) Synchronous publish: Using this mechanism, a callback cb_{i, j} directly publishes its output on \(T \in cb_{i,j}[\mathbb {PT}]\), i.e., the executor thread \(N_i[\mathcal {T}_{ex}]\) runs the logic in cb_{i, j} as well as the task of publishing its output to the subscribers.

2.2.1 Table-driven reservation-based scheduling.

In this scheduling environment, threads run inside reservation servers. A server R_m is defined using a set of time slots, \(R_m[Sl] = \lbrace sl_{m,1}, sl_{m,2}, \ldots , sl_{m,\nu _m}\rbrace\), and a cycle time, R_m[CT], that are statically configured. Each slot sl_{m, n} has a start time st_{m, n} and an end time et_{m, n} and it repeats every R_m[CT] time units. A thread \(\mathcal {T}\) assigned to \(\mathcal {T}[RS] = R_m\) can only run inside R_m’s time slots. Besides enabling timing determinism, a server constrains the amount of time for which its threads can run on the processor, thereby providing temporal isolation between different sets of threads.

We use LITMUS^RT to perform our experiments. It provides real-time schedulers and synchronization mechanisms. It has a plugin P-RES implementing partitioned reservation-based scheduling that allows to define and use table-driven reservation servers. It allows to (i) assign a server R_m to a processor core R_m[PC] and (ii) map a set of threads \(R_m[\mathbb {T}]\) to it. If multiple threads are ready-to-run in a server, it implements a round-robin scheduling policy to allocate processor time to these threads. It allows to assign a priority R_m[Pr] to a server R_m. In LITMUS^RT implementation, a smaller value of R_m[Pr] implies a higher priority. Priorities are useful for isolating high-critical applications from best-effort applications. That is, simultaneous to a server R_m running time-sensitive threads, we can define another server \(R_{m^{\prime }}\) with overlapping time slots and \(R_{m^{\prime }}[Pr] \gt R_{m}[Pr]\) to run best-effort threads. In this case, \(R_{m^{\prime }}\) can run its threads only when there are no ready-to-run or running threads in R_m.

This paper promotes table-driven reservation-based scheduling in LITMUS^RT because it offers both timing determinism and good processor utilization for mixed-criticality workloads.

2.3 Modeling ROS2 computation chains

While it is clear that ROS2 applications can form complex DAGs, we note that this work does not promise to efficiently schedule a whole DAG. Instead, it focuses on scheduling a jitter-sensitive critical chain of ROS2 callbacks. We define a computation chain ch_k as an ordered sequence (ch_k[1], ⋅⋅⋅, ch_k[l], ch_k[l + 1], ⋅⋅⋅, ch_k[μ_k]) of μ_k callbacks—\(ch_k[l] \in \mathbb {V}\) is a callback—connected via topics. That is, there is a directed edge from ch_k[l] to ch_k[l + 1] in \(\mathbb {G}\), i.e., \(ch_k[l][\mathbb {PT}] \cap ch_k[l+1][\mathbb {ST}] = \lbrace T_k[l]\rbrace \ne \emptyset\). The first callback ch_k[1] in a chain is typically a timer callback that performs sensor data acquisition. The output of the chain is published on a topic \(T_k[\mu _k] \in ch_k[\mu _k][\mathbb {PT}]\) from which the actuator reads the data.

We are interested in minimizing the jitters J_k in the end-to-end latency L_k of a critical chain ch_k. For a particular chained execution of callbacks in ch_k, L_k is the time between the start of ch_k[1] and the instant when ch_k[μ_k] publishes T_k[μ_k]. We denote the maximum and the minimum observed end-to-end latency of ch_k by \(\overline{L}_k\) and L_k, respectively. We define jitters J_k as \(\overline{L}_k - \underline{L}_k\) following [12].

For the purpose of scheduling the chain, we are further concerned about the execution and response times of its callbacks. We define the execution time e_k[l] as the time required to run by an instance of the callback ch_k[l], while we denote its maximum observed value by \(\overline{e}_k[l]\). The response time r_k[l] of ch_k[l] is the time between its start and end for a single run and its maximum observed value is denoted by \(\overline{r}_k[l]\).

We understand that a critical chain ch_k is a part of the DAG because of which it experiences certain unavoidable interference and our methodology does not exclude such possibilities. (i) The node of a callback ch_k[l], ch_k[l][Nd] = N_i, may comprise other callbacks that are not part of ch_k, i.e., Γ_i∖ch_k ≠ ∅. Considering that \(ch_k[l][Nd][\mathcal {T}_{ex}]\) will also run these callbacks besides ch_k[l], they can delay the execution of ch_k[l] and increase L_k. (ii) If ch_k[l] publishes on a topic \(T \in ch_k[l][\mathbb {PT}]\) which is not subscribed to by ch_k[l + 1], the time spent in computing and publishing the data on T will add to et_k[l], thereby increasing L_k. (iii) If ch_k[l] is a synchronizing callback, it might not run even after receiving the data on T_k[l − 1] published by ch_k[l − 1] and wait for data on other synchronized topics in \(ch_k[l][\mathbb {ST}] \setminus {T_k[l-1]}\). This again increases L_k.

We emphasize that to manage the timings of a chain ch_k, we only consider to (i) choose an OS scheduling policy and (ii) configure the scheduling parameters of the executor threads in \(ch_k[\mathbb {T}_{ex}] = \lbrace ch_k[l][Nd][\mathcal {T}_{ex}]| 1\le l \le \mu _k\rbrace\) and the DDS threads in \(ch_k[\mathbb {T}_{DDS}] = \cup _{l=1}^{\mu _k} ch_k[l][Nd][\mathbb {T}_{DDS}]\). This is consistent with the timing engineering process generally followed in the industry during which changes to the application code and the software platform (including middleware libraries) are not easily possible.

Figure 2:

2.4 Profiling-based timing model extraction

As highlighted in the survey results in [2], measurement-based timing analysis is more popularly practised in the industry. Hence, as a first work on jitter-control of ROS2 computation chains, we focus on profiling their execution in the target environment covering different scenarios. In the process, we obtain (i) \(\overline{e}_k[l]\) and \(\overline{r}_k[l]\) for a callback ch_k[l] in ch_k and (ii) \(\overline{L}_k\) and L_k for ch_k. Although this paper shows how our methods use measurement data, they can also be applied using analysis results obtained, e.g., by applying a combination of WCET, worst-case response time (WCRT), and end-to-end analysis techniques. However, to the best of our knowledge, such a full-fledged analytical framework is not available for industry use. WCET analysis tools either do not scale to real-world workloads, e.g., related to autonomous functionalities, or provide very pessimistic results. Also, table-driven reservation-based scheduling of ROS2 applications and the related WCRT and end-to-end latency analyses have not been studied so far.

A few works in recent years discuss the measurement of the end-to-end latency of ROS2 computation chains. (i) [28] and [45] directly instrument the application code and record timestamps. (ii) Autoware_Perf [30] and CARET [29] extend ros2_tracing [6] to add trace points in ROS2 and measure the end-to-end latency from collected traces. We follow the second approach and use [1]—a ROS2 tracing framework based on extended Berkeley Packet Filter (eBPF). It already offers trace processing tools to (i) construct DAG representations of ROS2 applications and (ii) measure execution and response times of callbacks. Further, we extend it with our end-to-end latency measurement tool which works as described below.

For each execution of a chain ch_k, our tool traverses the events related to each callback ch_k[l] in ch_k, as illustrated in Figure 2. For a callback ch_k[l] except the first one, we find the following events in order: callback_start, data_read, data_write, and callback_end. We do not get any data_read event for ch_k[1] because we assume that it is a timer callback that acquires the sensor data directly via device drivers. We move from ch_k[l] to ch_k[l + 1] in ch_k by matching the topic name and the source timestamp of the data in the data_write event of ch_k[l] with the ones in the data_read event of ch_k[l + 1]. Here, the source timestamp is used to uniquely identify a data item across the publisher and subscriber callbacks. We save the start time of ch_k[1] and after traversing the chain, we find the time when ch_k[μ_k] publishes the chain’s output on T_k[μ_k]. We can measure L_k as the difference between the two time instants. We shall note that the measured values of L_k already subsume the interference experienced by the chain, which is key to understanding our proposed mechanism.

2.5 Priority- vs time-driven chain execution

This section demonstrates the timing behavior of a high-critical ROS2 computation chain ch₁ when implemented using two standard real-time scheduling policies, i.e., fixed-priority and time-triggered scheduling. Here, ch₁ spans five ROS2 nodes {N₁, N₂, N₃, N₄, N₅} where each node N_i comprises one callback cb_{i, 1}. Further, ch₁ starts with a timer callback ch₁[1] = cb_{1, 1} in N₁ that runs every 100 ms and publishes data on a topic T₁[1]. Each of the other nodes runs a subscriber callback ch₁[l] = cb_{l, 1} that reads data from topic T₁[l − 1], runs computation load, and then publishes data on T₁[l]. The chain’s output is, hence, published on T₁[5]. For each callback ch₁[l], the execution time e₁[l] varies uniformly between 5 ms and 15 ms. We run the executor threads \(ch_1[\mathbb {T}_{ex}]\) on one processor core. Additionally, we create a node N₆ and run \(N_6[\mathcal {T}_{ex}]\) on the same core. It runs a timer callback cb_{6, 1} every 100 ms that interferes with ch₁. We vary the execution time of cb_{6, 1} between int and \(\frac{int}{2}\) where int ∈ {10, 20, 30, 40} ms. We run the DDS threads \(ch_1[\mathbb {T}_{DDS}] \cup N_6[\mathbb {T}_{DDS}]\) on another processor core. We use two schedule configurations as follows:

(1)

Using \(\mathbb {S}_1\), we assign (i) a high priority to each executor thread in \(ch_1[\mathbb {T}_{ex}]\) and (ii) a low priority to \(N_6[\mathcal {T}_{ex}]\).

(2)

Using \(\mathbb {S}_2\), we create five high-priority servers R₁, R₂, R₃, R₄, and R₅ with the time slots (in ms) sl_{1, 1} = [0, 16), sl_{2, 1} = [16, 32), sl_{3, 1} = [32, 48), sl_{4, 1} = [48, 64), and sl_{5, 1} = [64, 80), respectively. Further, we create a low-priority server R₆ with the time slot (in ms) sl_{6, 1} = [0, 100). For 1 ≤ i ≤ 6, we assign R_i[CT] = 100 ms and \(R_i[\mathbb {T}]=\lbrace N_i[\mathcal {T}_{ex}]\rbrace\).

In both configurations, (1) and (2), we run the DDS threads using \(\mathbb {S}_1\) while assigning a high priority to \(\mathcal {T} \in ch_1[\mathbb {T}_{DDS}]\) and a low priority to \(\mathcal {T} \in N_6[\mathbb {T}_{DDS}]\).

Table 1:

Interfering load per 100 ms	0	10	20	30	40
Priority-driven scheduling lat.	57.3	58.1	58.7	59.8	62.8
Priority-driven scheduling jit.	32.4	33.4	34.2	35.4	36.2
Time-driven scheduling lat.	≈ 80 ms
Time-driven scheduling jit.	≈ 10 ms

Table 1: Measured maximum end-to-end latency (lat.) and jitters (jit.) in ms.

We measure L₁, \(\overline{L}_1\), and J₁. Table 1 shows the variation in \(\overline{L}_{1}\) and J₁ with int using (1) and (2). We have the following observations:

•

Using (1), we obtain lower values of \(\overline{L}_1\). However, we get higher values of J₁ which is mainly contributed by the varying execution times—common in real-world applications—of the chain’s callbacks. Hence, using priority-driven scheduling, it is challenging to control jitters in a chain when it has time-varying workloads.

•

Compared to (1), \(\overline{L}_1\) increases using (2). This is due to the over-provisioned time slots {sl_{i, 1}|1 ≤ i ≤ 5} that are tuned based on the maximum observed response times \(\lbrace \overline{r}_1[i]|1\le i \le 5\rbrace\) of the callbacks in ch₁. That is, until sl_{i, 1} comes, ch₁[i] cannot start even when ch₁[i − 1] has finished execution. Nevertheless, due to such constrained progress in the chain’s execution, J₁ reduces to reflect only the response time variation of the chain’s last callback. Hence, using time-triggered scheduling as in (2), jitters reduce at the expense of an increased maximum end-to-end latency.

•

Using (1), we see a slight increase in \(\overline{L}_{1}\) with increasing int. This is because \(\mathbb {S}_1\) is a practical implementation of fixed-priority preemptive scheduling with non-negligible preemption costs. Such costs are also present in (2) but are not reflected in \(\overline{L}_{1}\) due to the over-provisioned slots (1 ms longer) for running the callbacks in ch₁.

Concerning the above observations, our goal is to get the best of priority-driven and time-triggered scheduling schemes. We aim to eliminate jitters in the chain while its maximum end-to-end latency remains comparable to what we get from its high-priority execution.

3 End-to-End Latency Management

This section shows how our proposed mechanism uses table-driven reservation-based scheduling to manage the end-to-end latency of a jitter-sensitive chain while considering ROS2 semantics. It follows that table-driven reservations are effective in implementing certifiable safety-critical systems [46].

3.1 Controlling maximum end-to-end latency

Let us implement a critical chain ch_k similar to (1) in Section 2.5. We run \(\mathcal {T} \in ch_k[\mathbb {T}_{ex}]\) and \(\mathcal {T} \in ch_k[\mathbb {T}_{DDS}]\) using \(\mathbb {S}_1\) and a high priority. This follows from an industry practice of prioritizing critical workloads to minimize interference experience by them. We profile such a high-priority execution of ch_k and measure its maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_1)\) and jitters \(J_k(\mathbb {S}_1)\). We note that \(\overline{L}_k(\mathbb {S}_1)\) subsumes the unavoidable interference experienced by ch_k as explained in Section 2.3. We do not consider co-scheduling multiple critical chains.

Following our goal to keep the maximum end-to-end latency close to \(\overline{L}_k(\mathbb {S}_1)\), we define a server R_m. Without loss of generality, we configure R_m with a time slot sl_{m, 1} where:

\begin{equation}sl_{m,1} = [0, \overline{L}_1(\mathbb {S}_1) + \epsilon ]. \end{equation}

(1)

We can also shift sl_{m, 1} by an offset while co-optimizing the timing performance of multiple chains, however, it is not the focus of this paper.

The cycle time R_m[CT] of R_m is equal to the period of the chain execution. We consider that \(\overline{L}_k(\mathbb {S}_1) \lt R_m[CT]\), i.e., one execution of ch_k can be completed within a cycle. Further, we assume that the maximum processor utilization contributed by \(\mathcal {T} \in ch_k[l][\mathbb {T}_{ex}]\) is less than 100%. That is, one CPU core can run all threads in \(ch_k[\mathbb {T}_{ex}]\) without any overload. In this paper, we study this simple case. However, when \(\overline{L}_k(\mathbb {S}_1) \gt R_m[CT]\), our method can be extended to cover two or more processor cores and define servers on each core for a pipelined execution of the chain. The placement of time slots in that case needs to consider profiles of sub-chains, which we leave as a future work.

We assign the threads running the callbacks in ch_k to R_m, i.e., \(R_m[\mathbb {T}] = ch_k[\mathbb {T}_{ex}]\). At the beginning of sl_{m, 1} in a cycle, ch_k[1] runs. Thereafter, ch_k[l] can start whenever it has received the data from ch_k[l − 1]—similar to using \(\mathbb {S}_1\) and a high priority. Hence, ch_k will finish execution as soon as possible and the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\) using this technique will be approximately equal to \(\overline{L}_k(\mathbb {S}_1)\). We assign a high priority to R_m, e.g., R_m[Pr] = 1 using LITMUS^RT. To improve the utilization of the processor core when ch_k does not run until \(\overline{L}_k(\mathbb {S}_2)\), we can define another server R_m with R_m[Pr] > R_m[Pr] and map threads performing best-effort tasks to it. In that case, R_m basically isolates ch_k from the interference by the best-effort tasks and we still get \(\overline{L}_k(\mathbb {S}_2) \approx \overline{L}_k(\mathbb {S}_1)\).

Experimentally, we have also studied the DDS communication threads in a ROS2 node. We have observed that a few of them—depending on the DDS implementation we use—influence the end-to-end latency by delaying the communication between the chain’s callbacks. Hence, we must isolate these threads from interference by best-effort workloads. Further, we have observed that while they run during data send and receive, they also run at other instants performing protocol-related tasks (e.g., sending heartbeat signals and polling message queues). Hence, we cannot assign them to servers with specific time slots without delaying an important DDS-related task significantly (e.g., by tens of milliseconds). Also, they typically run for a short duration (several hundred microseconds) when they wake up. They mostly run in parallel to the ROS2 threads executing the callbacks. Considering the above observations related to the DDS threads, we bind the threads in \(ch_k[\mathbb {T}_{DDS}]\) to a different processor core and use \(\mathbb {S}_1\) to schedule them with a high priority. This ensures that the DDS communication in ch_k will have a minimal latency.

Figure 3:

3.2 Controlling jitters

When we schedule ch_k using a high-priority server R_m as explained above, we can minimize the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\). Now, to achieve the goal of reducing the jitters, we study two mechanisms for publishing the chain’s output, namely, asynchronous and synchronous publish, as described in Section 2.2.

3.2.1 Asynchronous publish.

When the last callback ch_k[μ_k] in ch_k writes its output using this mechanism, the publisher thread \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) publishes the data to its subscribers. We propose to execute \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) in a server \(R_{m^\circ }\), as illustrated in Figure 3 Design-1. We configure \(R_{m^\circ }\) with a time slot \(sl_{m^\circ ,1}\) given by:

\begin{equation}sl_{m^\circ ,1} = \Big [et_{m,1}, et_{m,1} + \overline{e}_k[pub]\Big ]. \end{equation}

(2)

Here, et_{m, 1} marks the end of sl_{m, 1} in R_m (see Equation 1) and \(\overline{e}_k[pub]\) is the maximum time required to publish the ch_k’s output. Here, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) can be woken up any time within sl_{m, 1} but it cannot run until it gets the time slot \(sl_{m^\circ ,1}\). Hence, ch_k’s output is available to its subscribers only during \(sl_m^\circ ,1\). That is, the end-to-end latency \(L_k(\mathbb {S}_2)\) of ch_k varies in between \(\overline{L}_k(\mathbb {S}_1) + \epsilon\) and \(\overline{L}_k(\mathbb {S}_1) + \epsilon + \overline{e}_k[pub]\). We have observed that \(\overline{e}_k[pub] \lt 1\:\text{ms}\) which will result in \(J_k(\mathbb {S}_2) \lt 1\:\)ms. We can further configure \(R_{m^\circ }\) such that:

\begin{equation}R_{m^\circ }[Pr] = R_{m}[Pr]; R_{m^\circ }[PC] = R_{m}[PC]; R_{m^\circ }[CT] = R_{m}[CT]. \end{equation}

(3)

We term the above idea as latency shaping. Typically, a task is either event- or time-triggered. However, latency shaping requires a time- and event-triggered task to publish the chain’s output. That is, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) is triggered first when ch_k[μ_k] invokes an appropriate DDS API to publish data, and later, at the time instant when \(sl_{m^\circ ,1}\) starts, it is dispatched. Here, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) cannot run until both conditions are fulfilled.

3.2.2 Synchronous publish.

Using this mechanism, ch_k[μ_k] directly publishes ch_k’s output, i.e., \(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) runs ch_k[μ_k] as well as publishes data on T_k[μ_k]. We study only static allocation of threads to servers. Now, if we keep \(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) in R_m, we end up getting same jitters \(J_k(\mathbb {S}_1)\) as with the high-priority execution of ch_k. Also, we can use two servers, R_m and \(R_{m^\circ }\), as in Section 3.2.1, where threads in \(ch_k[\mathbb {T}_{ex}] \setminus \lbrace ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\rbrace\) run using R_m and \(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) uses \(R_{m^\circ }\). Here, we can configure sl_{m, 1} based on the measured maximum end-to-end latency \(\overline{L}_{k-}(\mathbb {S}_1)\)—during a high-priority execution—of the sub-chain ch_{k −} comprising {ch_k[l]|1 ≤ l ≤ μ_k − 1}, while \(sl_{m^\circ ,1}\) shall be longer than the maximum response time rt_k[μ_k] of ch_k[μ_k]. Further, sl_{m, 1} is immediately followed by \(sl_{m^\circ ,1}\) in time. However, in this design, the execution time variation in ch_k[μ_k] still contributes to jitters in ch_k, which may not be acceptable.

Hence, we propose to add a latency-shaping subscriber callback ch_k[μ_k + 1] inside a ROS2 node ch_k[μ_k + 1][Nd], as shown in Figure 3 Design-2. We change the name of the topic T_k[μ_k] on which ch_k[μ_k] publishes. Here, we do not modify or recompile the application code, which is a crucial consideration as the application sources are often not available to a systems engineer in the industry. In fact, we can just remap the topic name of T_k[μ_k] in the launch file from, e.g., T_k[μ_k][Nm] to \(T_k[\mu _k][Nm]\_tmp\). We implement ch_k[μ_k + 1] to subscribe to and read data from T_k[μ_k] (i.e., the subscribed topic name must be \(T_k[\mu _k][Nm]\_tmp\)) and publish the same data to T_k[μ_k + 1], i.e., it republishes ch_k’s output. We further ensure that T_k[μ_k + 1][Nm] = T_k[μ_k][Nm], i.e., it has the original name of T_k[μ_k]. That is, the subscribers of ch_k’s output will now read data from T_k[μ_k + 1]. This essentially extends the chain ch_k to (ch_k[1], …, ch_k[μ_k], ch_k[μ_k + 1]) considering that the end of its execution is identified when its output is available.

Now, we run this extended chain using two servers, R_m and \(R_{m^\circ }\), that execute \(R_m[\mathbb {T}] = \lbrace ch_k[l][Nd][\mathcal {T}_{ex}]|1 \le \mu _k\rbrace\) and \(R_{m^\circ }[\mathbb {T}] = \lbrace ch_k[\mu _k+1][Nd][\mathcal {T}_{ex}]\rbrace\), respectively. While R_m shall comprise a slot sl_{m, 1} as per Equation 1, \(R_{m^\circ }\) shall comprise a slot given by:

\begin{equation}sl_{m^\circ ,1} = \Big [et_{m,1},et_{m,1} + \overline{r}_k[\mu _k+1]\Big ], \end{equation}

(4)

where, \(\overline{r}_k[\mu _k+1]\) is the maximum observed response time of ch_k[μ_k + 1] to execute the logic for republishing ch_k[μ_k]’s output. Here, ch_k[μ_k + 1] only republishes a data item and does not perform any computation, thus, \(\overline{r}_k[\mu _k+1]\) is short and impacts the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\) negligibly. Also, the time to republish should be fairly constant and, hence, the jitters \(J_k(\mathbb {S}_2)\) are negligible.

Additionally, we note that \(ch_k[\mu _k+1][Nd][\mathbb {T}_{DDS}]\) shall be scheduled using \(\mathbb {S}_1\) and a high priority to avoid communication delays between ch_k[μ_k] and ch_k[μ_k + 1].

3.3 Multiple latency bands

So far, we have obtained one short band in which the end-to-end latency lies for each chain execution. Also, we can produce more such bands by configuring \(R_{m^\circ }\) with multiple non-overlapping time slots, as shown in Figure 3 Design-3. The length \(\Delta _{m^\circ }\) of each time slot is calculated from Equation 2 or 4 based on how chain’s output is published. For example, \(\Delta _{m^\circ } = \overline{r}_k[\mu _k+1]\) if ch_k[μ_k] publishes synchronously. We can place time slots \((sl_{m^\circ ,1}, sl_{m^\circ ,2}, \ldots , sl_{m^\circ ,\nu _{m^\circ }})\) in the chronological order, where \(sl_{m^\circ ,\alpha } = [x_{m^\circ ,\alpha } - \Delta _{m^\circ }, x_{m^\circ ,\alpha })\). We consider \(x_{m^\circ ,1} - \Delta _{m^\circ } \gt \underline{L}_k(\mathbb {S}_1)\), where \(\underline{L}_k(\mathbb {S}_1)\) is the observed minimum end-to-end latency of ch_k during its high-priority execution. Otherwise, ch_k may never publish its output in \(sl_{m^\circ ,1}\). We place \(sl_{m^\circ ,\nu _{m^\circ }}\) in the same position as our single-band solution (i.e., following Equation 2 or 4). Hence, we do not change the maximum end-to-end latency of ch_k.

Such a configuration of \(R_{m^\circ }\) supports a multi-mode implementation of a chain ch_k where, in each mode, it uses a specific slot in the cycle to publish its output. That is, we get negligible jitters in the end-to-end latency in each operating mode of ch_k. An autonomous system, e.g., a car, operates in different environments or modes producing largely varying timing behavior of critical computation chains. For example, during a heavy traffic, the end-to-end latency from LIDAR/camera to steering and speed control might be long because of a long computation time in object detection and tracking. System-level performance can improve if specific latency-aware control logic is designed per mode [37, 38] and our design supports that.

We foresee that mode switch conditions are statically defined. We consider that the active mode can be read from a topic by ch_k[μ_k + 1][Nd]. It can use the knowledge to ensure that it publishes in an appropriate slot to maintain the end-to-end latency of the mode. Here, we need to put additional logic in ch_k[μ_k + 1][Nd]. A prospective implementation of ch_k[μ_k + 1][Nd] is briefly discussed in Section 6. The important point is that for our design, we do not need to change the server configurations online during a mode switch. We note that to co-design a multi-mode application logic and its timing design following multi-band latency shaping is a future work.

4 Design Automation

Figure 4:

4.1 Placement of time slots

In single-band latency shaping (see Section 3.2), the average end-to-end latency \(\tilde{L}_k(\mathbb {S}_2)\) of a chain ch_k is close to the maximum value \(\overline{L}_k(\mathbb {S}_2)\). It may be significantly higher compared to what we can get during the priority-driven execution of ch_k. Besides supporting multi-mode application design, our multi-band latency shaping can also improve the average end-to-end latency value compared to a single-band implementation. Consider that the design specification allows to have \(\nu _{m^\circ }\) latency bands for ch_k. We formulate a mathematical problem to place \(\nu _{m^\circ }\) time slots in \(R_{m^\circ }\). In the process, our goal is to minimize the average end-to-end latency \(\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ })\) of ch_k using \(\nu _{m^\circ }\) time slots. We choose to minimize \(\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ })\) because it is often used as a performance measure by systems engineers in the industry.

Towards our goal, we profile ch_k during its high-priority execution. We consider a cumulative probability distribution function ϕ(·) where ϕ(x) gives the measured probability that \(L_k(\mathbb {S}_1) \le x\). We represent each latency band by just one value corresponding to the end of the time slot (e.g., \(x_{m^\circ ,\alpha }\) for \(sl_{m^\circ ,\alpha }\)). The probability that \(x_{m^\circ ,\alpha -1} \lt L_k(\mathbb {S}_1) \le x_{m^\circ ,\alpha }\) is \(\Phi (\alpha -1,\alpha) = \phi (x_{m^\circ ,\alpha }) - \phi (x_{m^\circ ,\alpha -1})\), which is also approximately equal to the probability that the ch_k will publish its output in \(sl_{m^\circ ,\alpha }\). Hence, we derive an approximate expression to compute \(\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ })\) as follows:

\begin{equation}\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ }) = \phi (x_{m^\circ ,1}) \cdot x_{m^\circ ,1} + \sum _{\alpha = 2}^{\nu _{m^\circ }} \Phi (\alpha -1,\alpha) \cdot x_{m^\circ ,\alpha }. \end{equation}

(5)

We consider constraints on placing \(sl_{m^\circ ,1}\) (or \(x_{m^\circ ,1}\)) and \(sl_{m^\circ ,\nu _{m^\circ }}\) (or \(x_{m^\circ ,\nu _{m^\circ }}\)), as explained in Section 3.3. We consider that two consecutive slots are non-overlapping, i.e., \(x_{m^\circ ,\alpha - 1} \lt x_{m^\circ ,\alpha } - \Delta _{m^\circ }\).

Further, in ROS2, a liveliness quality-of-service (QoS) [11] can be defined for a topic. When a data item in such a topic becomes older than the specified duration (or QoS) LV it is not processed any further. In that case, we can consider a constraint that the distance between two adjacent bands should be less than the QoS requirement, i.e., \(x_{m^\circ ,\alpha } - x_{m^\circ ,\alpha - 1} \le LV\).

We use particle swarm optimization (PSO) [26] to solve the above constrained minimization problem and obtain the position of latency bands. We use PYSWARMS [32], a Python library for PSO.

Figure 5:

When our multi-band latency shaping is not used in combination with multi-mode application design, we can also adapt our single-band implementation to trade-off between \(\tilde{L}_k(\mathbb {S}_2)\) and \(J_k(\mathbb {S}_2)\). The more we shift \(st_{m^\circ ,1}\) of \(sl_{m^\circ ,1}\) in \(R_{m^\circ }\) towards \(\underline{L}_k(\mathbb {S}_1)\), \(\tilde{L}_k(\mathbb {S}_2)\) reduces while \(J_k(\mathbb {S}_2)\) increases. Hence, our proposed two-server scheduling of a chain enables shaping the end-to-end latency with respect to different timing requirements.

4.2 Tool chain for latency shaping

We develop a tool chain to automate latency shaping for a ROS2 computation chain ch_k, as depicted in Figure 5. First of all, we offer a Chain Profiler to collect traces and measure the chain’s end-to-end latency (see Section 2.4) when run using SCHED_FIFO and a high priority. Using the measured values provided by the Profiler, Slot Calculator computes the position of the time slots for \(R_{m^\circ }\) offline (see Section 4.1).

For a chain ch_k that publishes its output synchronously, we need a ROS2 node ch_k[μ_k + 1][Nd] with a latency shaping callback ch_k[μ_k + 1] extending the chain. For this, we develop a template node with a callback that subscribes to a topic and republishes a consumed data on another topic. We offer a tool, Republisher Editor, to automatically make necessary modifications to extend the chain as follows: (i) It edits the launch file of the original application to remap the topic name of T_k[μ_k] published by ch_k[μ_k], i.e., from T_k[μ_k][Nm] to \(T_k[\mu _k][Nm]\_tmp\). (ii) It edits the launch file of ch_k[μ_k + 1][Nd] and remaps subscribed and published topic names to \(T_k[\mu _k][Nm]\_tmp\) and T_k[μ_k][Nm], respectively. (iii) It uses a ROS2 API—ros2 topic type—to get the data type T_k[μ_k][Typ] of the chain’s output. From T_k[μ_k][Typ], it derives the ROS2 package T_k[μ_k][pkg] and the header file T_k[μ_k][Hd] containing the declaration of T_k[μ_k][Typ]. (iv) It edits CMakeLists.txt and package.xml files in ch_k[μ_k + 1][Nd] to link to T_k[μ_k][pkg]. (v) It edits the source code of ch_k[μ_k + 1][Nd] to include T_k[μ_k][Hd] and declare the type of subscribed and published topics as T_k[μ_k][Typ]. (vi) It builds ch_k[μ_k + 1][Nd] to generate the desired executable.

Further, we offer a tool, Server Configurator, to create and run two servers, R_m and \(R_{m^\circ }\). The Configurator reads files to obtain (i) the measured maximum end-to-end latency of the chain \(\overline{L}_k(\mathbb {S}_1)\) while creating R_m (see Equation 1) and (ii) the time slots produced by the Calculator while creating \(R_{m^\circ }\) (see Section 3.3). Also, we offer a tool, Thread Allocator, to (i) first identify the threads in the ROS2 nodes over which ch_k spans, i.e., \(ch_k[\mathbb {T}_{ex}]\) and \(ch_k[\mathbb {T}_{DDS}]\), and (ii) then allocate them to R_m and \(R_{m^\circ }\) as per design. The Allocator uses Linux tools (i.e., ps with appropriate arguments) to identify a ROS2 node for a particular thread ID. It uses a LITMUS^RT API to allocate a thread to a server. At the end, the components of the chain run using servers following the latency shaping concept.

5 Case Studies

5.1 Autonomous valet parking (AVP)

We study localization in AVP and run its demo from Autoware [5] where the car starts from a predefined position, then drives to a parking spot, and later return back to the initial position. Our workstation has AMD Ryzen Threadripper Pro 3955WX with 16 CPU cores at 3.7 GHz, 64 GB RAM, and an Nvidia RTX2080 GPU. We run ROS2 Foxy over LITMUS^RT-patched Linux v5.4.1. We have implemented our techniques for two DDS implementations, i.e., Eclipse Cyclone DDS and eProsima’s Fast DDS. Considering that Cyclone DDS only supports synchronous publish mechanism which is also used in AVP, we present the results using it in this section.

In this demo, a rosbag [27] plays back raw point cloud data recorded from the car’s front and rear Lidars. Although it does not publish data at regular intervals, data are available at almost 10 Hz. To ensure that the localization runs every 100 ms, we introduce a ROS2 node N_Th, Throttler, as shown in Figure 4. It has two callbacks, cb_rTh and cb_fTh, where cb_rTh (cb_fTh) subscribes to the data from the rear (front) Lidar. Here, cb_rTh (or cb_fTH) blocks a data item if it has not come after a certain time interval with respect to the previous one, otherwise it republishes the data on a topic T_rTh (or T_fTh). We put the thread \(N_{Th}[\mathcal {T}_{ex}]\) running these callbacks in a server R_Th with a [0, 2] ms time slot repeating every R_Th[CT] = 100 ms. Hence, cb_rTh (or cb_fTh) publishes data every 100 ms and we can consider it to be a timer callback starting a chain following our definition in Section 2.3.

T_rTh (or T_fTh) is subscribed to by a callback cb_rFl (or cb_fFl) in a Rear Lidar Filter N_rFL (or Front Lidar Filter N_fFL) node, see Figure 4. Further, cb_rFl and cb_fFl publish the filtered point cloud data on topics, T_rFl and T_fFl, respectively. Lidar Fusion node N_fus runs a synchronizing callback cb_fus that subscribes to both T_rFl and T_fFl and publishes the fused point cloud data on T_fus. Voxel Grid node N_vg runs a callback cb_vg that subscribes to T_fus and publishes the downsampled point cloud data on T_ds. Finally, NDT Localizer node N_loc runs a callback cb_loc to estimate the car’s position based on the subscribed data on T_ds. Here, cb_loc publishes the estimated pose on T_pos. In our experiments, we apply latency shaping to the chain ch_rl originating with the acquisition of the rear Lidar data, i.e., it is formed by cb_rTh, cb_rFl, cb_fus, cb_vg, and cb_loc. The output of ch_rl is published on T_pos.

Figure 6:

Table 2:

cb_rTh	cb_fTh	cb_rFl	cb_fFl	cb_fus	cb_vg	cb_loc
0.6	0.6	19.9	30.7	3.3	9.2	50.8

Table 2: Measured WCETs of callbacks in AVP [ms].

Table 3:

	End-to-end latency
Scheduler Config.	Avg.	Max.	Jitters
SCHED_FIFO	67.4	84.7	51.8
1 server per callback	72.8	95.7	48.6
1 server per chain	64.2	84.4	51.6
1-band LS	86.6	86.7	≈ 0.2
3-band LS (Random)	76.4	86.6	≈ 0.2 per band
3-band LS (PSO)	71.3	86.5	≈ 0.2 per band

Table 3: Chain’s timing performance with different schedule configurations (all values are in ms).

5.1.1 Using SCHED_FIFO or \(\mathbb {S}_1\).

We configure each executor thread \(N_i[\mathcal {T}_{ex}]\) where N_i ∈ {N_rFl, N_fFL, N_fus, N_vg, N_loc} to run using \(\mathbb {S}_1\) and a high priority (i.e., 99). We bind \(N_{fFl}[\mathcal {T}_{ex}]\) to run on Core 0 while the others are bound to Core 1, which we maintain for all experiments with AVP. In both Core 0 and Core 1, we also run interfering load using timer callbacks, cb_{int, 0} and cb_{int, 1}, in nodes, N_{int, 0} and N_{int, 1}, respectively. Each run every 10 ms and has a maximum execution time of 2 ms. We assign a lower priority (i.e., < 99) to the threads \(N_{int,0}[\mathcal {T}_{ex}]\) and \(N_{int,1}[\mathcal {T}_{ex}]\). During the demo run, we measure—using the Chain Profiler—the end-to-end latency for each execution of the chain ch_rl and the maximum execution time (mWCET) of each callback. We provide mWCETs over multiple runs in Table 2. We show the variations in the end-to-end latency in one run in Figure 6a. We note the average \(\tilde{L}_{rl}(\mathbb {S}_1)\) and the maximum end-to-end latency \(\overline{L}_{rl}(\mathbb {S}_1)\) as well as the jitters \(J_{rl}(\mathbb {S}_1)\) in Table 3 Row 1. We see that ch_rl experience large jitters, i.e., 51.8 ms, that we want to eliminate using latency shaping. These results help us to configure the servers in the following experiments.

Table 4:

Row	Config.	N_Th	N_rFl	N_fus	N_vg		N_loc
1	1 server per callback	[0,2]	[2,23]	[33,37]	[37,47]		[47,98]
		R_Th	R_m				\(R_{m^\circ }\)
Row	Config.	N_Th	N_rFl	N_fus	N_vg	N_loc	N_sh
2	1 server per chain	[0,2]	[2,86]				—
3	1-band LS	[0,2)	[2,86]				[86,87]
4	Random 3-band LS	[0,2]	[2,86]				[50,51] [70,71] [86,87]
5	PSO-computed 3-band LS	[0,2]	[2,86]				[62,63] [77,78] [86,87]

Table 4: Reserved time slots to run executor threads of nodes involved in the chain (times are in ms).

5.1.2 Using conventional table-driven reservation or \(\mathbb {S}_2\).

Next, we run ch_rl following traditional time-triggered scheduling paradigm. We run each thread \(N_i[\mathcal {T}_{ex}]\), where \(N_i \in \lbrace \mathcal {N}_{Th}, \mathcal {N}_{rFl}, \mathcal {N}_{fus}, \mathcal {N}_{vg}, \mathcal {N}_{loc}\rbrace\), in a high-priority server comprising one time slot per 100 ms, which are given in Table 4 Row 1. Here, the time slots are computed based on the callbacks’ mWCET (see Table 2). We also run \(N_{fFl}[\mathcal {T}_{ex}]\) using a high-priority server R_fFl with a time slot [2, 33] ms. We see that the server running \(N_{fus}[\mathcal {T}_{ex}]\) starts from 33 ms while that running \(N_{rFl}[\mathcal {T}_{ex}]\) ends at 23 ms. This is because cb_fus needs to wait for data from cb_fFl, and R_fFl ends at 33 ms. We note that AVP localization is a DAG (see Figure 4) and not a linear chain, hence, we must consider such synchronization points while configuring servers. Further, interfering threads \(N_{int,0}[\mathcal {T}_{ex}]\) and \(N_{int,1}[\mathcal {T}_{ex}]\) are put in low-priority servers, R_{int, 0} and R_{int, 1}, each is configured with a time slot spanning the entire cycle. We do not change R_fFl, R_{int, 0} and R_{int, 1} for the remaining experiments with AVP.

Figure 6b shows the chain’s end-to-end latency variation using \(\mathbb {S}_2\). We obtain a maximum end-to-end latency of 95.7 ms (see Table 3 Rows 2), which is 13% longer than \(\overline{L}_{rl}(\mathbb {S}_1)\). This trend is similar to what we have seen for the synthetic chain in Section 2.5. Further, the jitters reduce from \(J_{rl}(\mathbb {S}_1) = 51.8\:\)ms to 48.6 ms like in Section 2.5. However, the reduction here is small because the last callback cb_loc contributes the most to the chain’s jitters, which cannot be eliminated using traditional time-triggered scheduling.

5.1.3 Using one reservation server (Section 3.1).

Now, we run the threads \(\lbrace N_{rFl}, N_{fus}, N_{vg}, N_{loc}\rbrace [\mathcal {T}_{ex}]\) in one high-priority server R_m, while we still keep \(N_{Th}[\mathcal {T}_{ex}]\) in R_Th so that it behaves as a timer callback. The slots in the servers are given in Table 4 Row 2. We configure R_m based on the measured maximum end-to-end latency \(\overline{L}_{rl}(\mathbb {S}_1)\). As expected, the measured maximum end-to-end latency (i.e., 84.4 ms) is similar to \(\overline{L}_{rl}(\mathbb {S}_1) = 84.7\:\)ms. Also, the jitters of 51.6 ms remain almost the same as \(J_{rl}(\mathbb {S}_1)= 51.8\:\)ms. Further, Figure 6c shows the chain’s end-to-end latency variation.

5.1.4 Single-band latency shaping (LS) (Section 3.2.2).

In Autoware’s implementation of NDT Localizer, Pose Estimate is published synchronously. Hence, we apply latency shaping by extending ch_rl with cb_sh in N_sh as shown in Figure 4. We allocate the threads running the extended chain using three reservation servers R_Th, R_m, and \(R_{m^\circ }\), where R_Th runs \(N_{Th}[\mathcal {T}_{ex}]\), R_m runs \(\lbrace N_{rFl},N_{fus},N_{vg},N_{loc}\rbrace [\mathcal {T}_{ex}]\), and \(R_{m^\circ }\) runs \(N_{sh}[\mathcal {T}_{ex}]\). The time slots in R_Th, R_m, and \(R_{m^\circ }\) are given in Table 4 Row 3. In Figure 6d, we see that the chain’s end-to-end latency under this configuration lies in a short band. We have a maximum value of \(\overline{L}_{rl}(\mathbb {S}_2) = 86.7\:\)ms which is slightly longer than \(\overline{L}_{rl}(\mathbb {S}_1) = 84.7\:\)ms. This is because we have (i) added cb_sh and (ii) set time slot boundaries in the order of ms. Also, we get jitters \(J_{rl}(\mathbb {S}_2) \approx 0.2\:\)ms, which is significantly lower compared to \(J_{rl}(\mathbb {S}_1) = 51.8\:\)ms.

5.1.5 Multi-band latency shaping (Section 3.3 and 4.1).

Further, we reconfigure \(R_{m^\circ }\) with 2 more time slots randomly placed, see Table 4 Row 4. Figure 6e shows that this produces three short bands where the chain’s end-to-end latency lies. Table 3 Row 5 shows that the average end-to-end latency is 76.4 ms, which is 13% longer than \(\tilde{L}_{rl}(\mathbb {S}_1)\), while being 12% shorter than our single-band latency shaping. The maximum end-to-end latency and the jitters per band change negligibly with respect to our single-band implementation.

Now, we place the additional time slots using PSO, as explained in Section 4.1. For readers acquainted with PSO, we use 100 particles and a maximum of 150 iterations, while we keep the hyperparameters as: c1 = 0.5; c2 = 0.3; w = 0.9. The PSO-computed time slots are provided in Table 4 Row 5 and the obtained latency bands are shown in Figure 6f. In Table 3, we see that the average end-to-end latency reduces to 71.3 ms, i.e., we can improve it by 7% relative to a random placement. Also, it is now only 6% longer than \(\tilde{L}_{rl}(\mathbb {S}_1)\). Even if we do not consider multi-mode application design, the jitters become 25 ms using this configuration, which is less than 50% of \(J_{rl}(\mathbb {S}_1)\), while we increase the maximum and the average end-to-end latency values by only 2% and 6%, respectively. However, we can also just assign one but a longer time slot [62, 87] ms to \(R_{m^\circ }\), which will further reduce the average end-to-end latency without affecting the jitters. This shows that latency shaping allows to explore the trade-off between the average end-to-end latency and the jitters of a computation chain.

Figure 7:

Figure 8:

For ch_rl, we further study how the average end-to-end latency reduces with more number of bands, as shown in Figure 7. We see that the average end-to-end latency changes negligibly beyond 4 bands.

5.2 Asynchronously-publishing synthetic chain

Figure 9:

For a proof of concept of our idea in Section 3.2.1, we experiment with a simple synthetic chain ch_syn with five callbacks, similar to the setup in Section 2.5. However, we implement ch_syn[5] to publish on T_syn[5] asynchronously. Here, we use eProsima’s Fast DDS because it offers such a publish mechanism as well as to demonstrate that our technique can be applied with any DDS implementation. We allocate \(ch_{syn}[\mathbb {T}_{ex}]\) running the callbacks in ch_syn as well as the publisher thread \(ch_{syn}[Nd][\mathcal {T}_{pub}]\) on Core 0. On Core 0, we also allocate an interfering timer callback cb_int in N_int, which is dispatched every 10 ms and has a maximum execution time of 2 ms. We create three servers R_m, \(R_{m^\circ }\), and R_m on Core 0 where R_m runs \(\mathcal {T} \in ch_{syn}[\mathbb {T}_{ex}]\), \(R_{m^\circ }\) runs \(ch_{syn}[Nd][\mathcal {T}_{pub}]\), and R_m runs \(N_{int}[\mathcal {T}_{ex}]\). Further, R_m and \(R_{m^\circ }\) are high-priority servers enabling latency shaping, while R_m is a low-priority server. Time slots for the servers are defined as follows: sl_{m, 1} = [0, 59] ms, \(sl_{m^\circ ,1}=\:[59,60]\:\)ms, and sl_{m, 1} = [0, 100] ms. Using this configuration, the end-to-end latency \(L_{syn}(\mathbb {S}_2)\) of ch_syn lies in a short band as shown in Figure 8 where \(\overline{L}_{syn}(\mathbb {S}_2)= 59.6\:\)ms while \(J_{syn}(\mathbb {S}_2)\approx 0.2\:\)ms. The figure also shows the variation in \(L_{syn}(\mathbb {S}_1)\) of ch_syn when it is scheduled using \(\mathbb {S}_1\) and a high priority. We also measure \(\overline{L}_{syn}(\mathbb {S}_1) = 58.5\:\)ms and \(J_{syn}(\mathbb {S}_1)=33\:\)ms. Again, latency shaping produces negligible jitters with a minimal increase (≈ 2%) in the maximum end-to-end chain latency.

5.3 Comparison with LET

Using servers, we can implement callbacks following LET. (i) When ch_k[l] publishes asynchronously, we put the associated publisher thread \(ch_k[l][Nd][\mathcal {T}_{pub}]\) in a high-priority server with a time slot placed at the beginning of the period. This will ensure that DDS communication takes place at the beginning of the period. We can say that the callback’s response time is equal to its period. (ii) When ch_k[l] publishes its output synchronously, we need to insert another callback in the chain that reads data from T_k[l] and republishes it—similar to a latency-shaping callback ch_k[μ_k + 1] in Section 3.2.2. We can then run the inserted callback using a server with a time slot placed at the beginning of the period.

Table 5:

	Measured maximum end-to-end latency [in ms]
Chain	1-band shaping	LET per callback	LET per chain
Synthetic	59.2	500	100
AVP	86.7	400 → simulation fails	100

Table 5: Comparison with LET.

Now, we apply LET to each callback in a synthetic chain with five callbacks and a period of 100 ms—as described in Section 2.5. Here, each callback publishes synchronously. We put a maximum interfering load of 2 ms per 10 ms as described in Section 5.2. As expected, we get a fairly constant end-to-end latency around 500 ms.

When the end-to-end latency of a chain is less than the period of its execution, we can consider the whole chain as one task. Hence, using our latency shaping concept, we can apply LET for the whole chain by just configuring \(R_{m^\circ }\) to comprise a time slot at the beginning of the period. In this case, we shall get a constant end-to-end latency of the chain that is equal to one period. Now, we apply LET to the aforementioned synthetic chain and get an end-to-end latency of 100 ms.

Further, we apply LET to ch_rl in AVP and get a fairly constant end-to-end latency around 100 ms, which is 15% longer than what we have obtained using latency shaping. We note that when we apply LET to each callback in ch_rl, the end-to-end latency becomes too long and the simulation does not run properly.

Table 5 provides the end-to-end latency values obtained using our single-band latency shaping and the LET implementations for both synthetic and AVP chains. It demonstrates that we can flexibly keep the end-to-end latency shorter than a conventional LET implementation if the chained computations always finish within a period.

5.4 Controlling jitters of a chain in a ROS2 DAG

As mentioned in Section 2.3, we do not consider DAG schedule optimization. However, given a partitioned optimal schedule of a DAG (where each callback runs on a specific processor core), we can apply our mechanism to control jitters of a chain in the DAG with a negligible increase in its maximum observed end-to-end latency. For a proof of concept, we first apply a recent schedule optimization technique [43] on an input DAG. It is based on deep reinforcement learning and minimizes the number of cores required to meet the worst-case DAG timing requirement. In the process, it can compute the core assignment for callbacks while also adding more edges in the DAG (or precedence relations between callbacks).

Figure 9a shows a transitive reduction (for a clear visualization) of such an output DAG. Each of the 18 vertices represents a callback. A vertex border gives the core assignment, i.e., we need two cores. This DAG executes every 764 ms. WCET of each callback is shown and we assume that the execution time varies uniformly between 0.5 × WCET and WCET. Now, we develop a synthetic ROS2 application with 18 nodes, each has one callback. Each callback publishes on a topic. Two outgoing edges from a vertex imply multiple subscribers of the topic. We implement vertex 0 as a timer callback (period 764 ms) while all other vertices are subscriber callbacks. We implement a vertex with two or more incoming edges as a synchronizing (subscriber) callback. We put computational load on each callback according to the the assumed execution time variation for it. We run this application following the core assignment in Figure 9a and using \(\mathbb {S}_1\) with a high priority. We measure the DAG’s end-to-end latency—the time between the start of callback 0 and when callback 1 publishes its output—and its variation is shown in Figure 9b. We observe a maximum value of 450 ms.

Now, we configure two high-priority servers R_m0 and R_m1 to run on Core 0 and Core 1, respectively, while each comprises a slot [0, 450] ms. The threads running the callbacks in Core 0 (Core 1) are assigned to R_m0 (R_m1). The output topic name of callback 1 is remapped from T[1][Nm] to \(T[1][Nm]\_tmp\). We add a node N_sh with a callback cb_sh with subscribed topic name \(T[1][Nm]\_tmp\) and published topic name T[1][Nm]. We put another high-priority server \(R_{m^\circ }\) on Core 1 with a time slot [450, 451). We map the thread \(N_{sh}[\mathcal {T}_{ex}]\) to \(R_{m^\circ }\). Further, we run the latency-shaped DAG application and record an almost constant end-to-end latency, see Figure 9b. We note that its analytical worst-case end-to-end latency is 560 ms (based on the WCETs), while our measurement-based approach produces ≈ 450 ms.

This DAG has one sink vertex. For a DAG with multiple sinks, we can add a latency-shaping node per sink vertex and schedule it using an appropriate server. That is, we can perform latency shaping for multiple chains in a DAG. However, we need a mechanism to first co-optimize their theoretical worst-case end-to-end latency values.

6 Future Considerations

In our problem setting as explained in Section 2.3, a critical chain may experience interference from a non-critical callback run by an executor thread that also runs a callback in the chain. For example, in Section 5.1, the Throttler node N_Th has two callbacks, cb_rTh and cb_fTh, while only cb_rTh is in the critical chain ch_rl. Both are run by \(N_{Th}[\mathcal {T}_{ex}]\) in R_Th and R_Th[Sl] is designed to sufficiently accommodate both callbacks every cycle. For more precise control over each callback’s execution in a node, we suggest developing it using multiple mutually exclusive callback groups and assigning each group to a different executor using the add_callback_group(…) function [36].

Further, this paper does not study the case when the chain needs to run longer than the slot duration in R_m, which will delay the next chain execution. We can adapt our design to eliminate such delays when the chain’s worst-case end-to-end latency is shorter than its period. That is, we can extend sl_{m, 1} to cover the entire period so that the chain will finish its computation within a period. However, the chain’s output is not published in the same period if the corresponding \(sl_{m^\circ ,1}\) is missed. We can add an appropriate liveliness QoS for T_k[μ_k] so that it is not processed in the next period. In this case, we discard a chain execution if it overruns the estimated maximum end-to-end latency. Several works [14, 24, 47, 48] have analyzed system safety—in terms of control stability and performance—when control updates may be skipped. In Section 3.1, we also commented on the case where the worst-case end-to-end latency is longer than the period.

While we have demonstrated how latency shaping can be combined with DAG scheduling, a better formalization of the application model and solution is future work. Also, when an executor thread runs multiple callbacks in a DAG, and it shall be statically bound to a core, partitioned DAG scheduling must consider that these callbacks run on one processing unit, which is similar to [42].

While we can support multi-mode application design, we have not formalized the logic inside a latency-shaping node ch_k[μ_k + 1][Nd] to ensure a constant end-to-end latency in a mode. First, ch_k[μ_k + 1] must be aware of the active mode, hence, another callback in ch_k[μ_k + 1][Nd] subscribes to the corresponding topic and passes the information via a shared variable. Further, ch_k[μ_k + 1] must track in which slot it is running which is possible using a timer callback in ch_k[μ_k + 1][Nd] that updates a counter in each slot in \(R_{m^\circ }\) while sharing the value with ch_k[μ_k + 1]. By being aware of the active mode and slot, ch_k[μ_k + 1] can decide appropriately to publish or discard the output or wait for the next slot. Here, we do not dynamically update \(R_{m^\circ }\).

7 Related Work

7.1 Methods to control timing variations

Several works in control theory have studied the impact of software and network timings on control performance. JITTERBUG [31] and JITTERTIME [14] estimate control performance in non-ideal timing conditions, e.g., jitters, execution overruns, dropped samples, and random sampling. It is widely accepted that the performance of a control application degrades with increasing sensing-to-actuation delay jitters, see [23] for an example. Hence, in the real-time systems community, mechanisms have been proposed to reduce jitters.

A popular choice for this purpose is to implement the logical execution time (LET) concept [25], which states that a task reads its inputs at the beginning of the period and produces the output at the end of the period, i.e., the response time of a task is exactly equal to one period. Towards an LET implementation, Biondi et al. in [7] schedule high-priority tasks at the beginning of the period to copy data (i) from the producer’s local memory to the global memory and (ii) from the global memory to the consumer’s local memory. Similarly, Pazzaglia et al. in [33, 34] use DMA to accelerate data transfers between tasks in different processing cores following LET paradigm. Further, in [18, 21, 22], a system-level LET concept is introduced for distributed applications in which an interconnect task exchanges data between two processors in a non-negligible yet constant time. However, existing LET implementations cannot be applied trivially to ROS2 chains as discussed in Section 1. Also, our proposed two-server scheduling mechanism allows a more flexible control over the end-to-end timings of such chains, in particular, can produce multiple short latency bands.

Further, Buttazzo and Cervin have proposed three more jitter-control mechanisms [12]: task splitting, advancing deadlines, and non-preemption. The latter two mechanisms are not effective when the tasks have large variations in the execution time. The first method proposes to split a task into several parts and schedule the first sub-task at the beginning of the period and the last sub-task at the end of the period, which is similar to an LET implementation while also requiring to modify a task implementation. However, our mechanism does not require to modify or recompile the application source code.

7.2 Timing analysis and optimization for ROS2

There have been efforts to study the real-time behavior of ROS2 applications in recent years. Casini et al. in [13] have outlined how ROS2 executors schedule callbacks in a node. Also, it has been pointed out that applications contain chains of callbacks and a compositional performance analysis (CPA) technique has been employed to bound the chains’ end-to-end latency when each ROS2 executor uses a constant-bandwidth server (CBS) to run callbacks. Later, Tang et al. have improved the precision of the worst-case end-to-end latency analysis [44] for the same system model. Blaß et al. in [8] have improved the results by further (i) reducing the pessimism in the callback activation model, (ii) considering a cumulative processor demand function for each callback, and (iii) making the timing analysis aware of the starvation freedom offered by ROS2 executors.

Previous works have also studied different scheduling policies in a ROS2 environment. To meet the end-to-end timing requirements, heterogeneous laxity-based DAG scheduling coupled with static priority assignment for ROS2 nodes have been studied [40]. Yang and Azumi have evaluated a callback-group-level ROS2 executor in [50] where a callback is assigned to an executor based on its real-time requirement (i.e., time-critical or best-effort). PiCAS [15] minimizes the chains’ end-to-end latency by statically prioritizing the execution of their callbacks. Also, it (i) assigns priorities to callbacks, (ii) allocates nodes to executors, (iii) maps executors to processing cores, and (iv) analyzes worst-case end-to-end latency. While the above works have considered static scheduling, Blaß et al. in [9] have dynamically refined the CBS budgets for ROS2 executors based on online timing measurements. Al Arafat et al. have proposed deadline-driven dynamic priority assignment to ROS2 chains as well as a worst-case end-to-end latency analysis technique [3]. Most of these works have mentioned that ROS2 chains often implement sense-compute-actuate control logic. However, all of them have focused only on the worst-case end-to-end latency and none of them have considered to minimize the jitters, which is the main focus of our work.

8 Concluding Remarks

This paper introduces the concept of latency shaping that primarily enables low-jitters implementation of a ROS2 computation chain. Compared to LET, it reduces pessimism as well as supports a multi-mode implementation of a chain. It is more practical because it uses the profiling results of the chain instead of relying on analytical frameworks. We further show how the concept can be implemented considering different mechanisms to publish data in ROS2. An important aspect of our proposed idea is that it does not require to modify or recompile the application code, which is crucial requirement to preserve the separation of concerns between application development and timing engineering in the industry.

While, in this work, we have considered static allocation of threads to reservation servers, in the future, we intend to explore dynamic allocation for the last callback in the chain so that we can split the computation and data publish tasks of the thread into two servers, thereby eliminating the use of an additional latency-shaping callback. It would be also interesting to explore co-design of control logic and latency shaping. While, we have considered computation chains over ROS2, DDS, and Linux, we believe that the idea is generic and can be applied to other middlewares and operating systems as well.

References

[1]

H. Abaza, D. Roy, S. Fan, S. Saidi, and A. Motakis. 2024. Trace-enabled Timing Model Synthesis for ROS2-based Autonomous Applications. In Design, Automation & Test in Europe Conference & Exhibition (DATE). https://arxiv.org/abs/2311.13333

Abstract

1 Introduction

2 Background and System Model

2.1 Modeling ROS2 applications as DAGs

2.2 Scheduler and thread models

2.2.1 Table-driven reservation-based scheduling.

2.3 Modeling ROS2 computation chains

2.4 Profiling-based timing model extraction

2.5 Priority- vs time-driven chain execution

3 End-to-End Latency Management

3.1 Controlling maximum end-to-end latency

3.2 Controlling jitters

3.2.1 Asynchronous publish.

3.2.2 Synchronous publish.

3.3 Multiple latency bands

4 Design Automation

4.1 Placement of time slots

4.2 Tool chain for latency shaping

5 Case Studies

5.1 Autonomous valet parking (AVP)

5.1.1 Using SCHED_FIFO or \(\mathbb {S}_1\).

5.1.2 Using conventional table-driven reservation or \(\mathbb {S}_2\).

5.1.3 Using one reservation server (Section 3.1).

5.1.4 Single-band latency shaping (LS) (Section 3.2.2).

5.1.5 Multi-band latency shaping (Section 3.3 and 4.1).

5.2 Asynchronously-publishing synthetic chain

5.3 Comparison with LET

5.4 Controlling jitters of a chain in a ROS2 DAG

6 Future Considerations

7 Related Work

7.1 Methods to control timing variations

7.2 Timing analysis and optimization for ROS2

8 Concluding Remarks

References

Index Terms

Recommendations

Impact of RTOS parameters on end-to-end timing performance

Integrated end-to-end timing analysis of networked AUTOSAR-compliant systems

Distributed pinwheel scheduling with end-to-end timing constraints

Comments

Information

Published In

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations