1 Introduction
The advancement in artificial intelligence (AI) and computer vision (CV) is driving cyber-physical systems (CPS) towards autonomy, e.g., self-driving cars, drones, and robots. Robot Operating System 2 (ROS2) is becoming pivotal in this drive due to (i) its vast repertoire of open-source implementations of AI and CV algorithms, (ii) its portability to different operating systems (OS), and (iii) its support for collaborative development of complex software applications.
Software applications over ROS2 consist of modular functional components called
nodes. A node comprises
callbacks to handle timer and communication events. The nodes send and receive data using
topics following Data Distribution Service (DDS) standard. An application can contain
computation chains formed by a sequential invocation of callbacks connected via topics. In autonomous systems, a chain reads and processes sensor data, performs planning, executes control logic, and applies actuation signals. [
20].
A long and unpredictable
end-to-end latency of such a chain can lead to a lower performance or even to unsafe physical behavior [
31,
39]. Static and dynamic priority-based scheduling are studied in ROS2 to reduce the end-to-end latency, e.g., [
3,
9,
15,
50]. However, none of the existing works consider to minimize the
jitters or variations in the end-to-end latency of ROS2 chains.
Typically, jitters are eliminated using techniques [
7,
18,
21,
22,
33,
34] following the
Logical Execution Time (LET) concept [
25]—the
de-facto industry standard, e.g., see AUTOSAR [
4]. Figure
1 a shows a conventional LET implementation where, at the beginning of a period, a task first writes its output from the previous execution, then reads the new input and starts processing it. Hence, a task spends a constant time—equal to its period—between data read and write. Also, a chain has a constant end-to-end latency. Nevertheless, existing LET implementations are not compatible with ROS2 chains due to the following reasons: (i) They require to control the timings of reading from and writing to topics, which is non-trivial for the more popular DDS communication mechanism. (ii) They require the callbacks to be time-triggered, which is not the case when one subscribes to a topic.
Proposed solution: Towards a
practical LET implementation of a ROS2 chain, we use
table-driven reservation servers. Such a server reserves specific time slots on a processing unit to run its assigned threads. We execute a high-critical (sub-)chain using high-priority servers and run best-effort workloads in low-priority servers, i.e., we consider
hierarchical scheduling. This offers a high real-time performance as well as a good processor utilization. Unlike conventional time-triggered scheduling where computation time is reserved for each task based on its worst-case execution time (WCET), we assign
chained callbacks to a server
Rm, capable of running multiple threads. We dimension
Rm based on the maximum measured end-to-end latency
\(\overline{L}\) of the assigned chain during a priority-driven execution. Such measurement-based approaches are popular in the industry [
2].
Further, we study two DDS communication mechanisms. (1) When asynchronous publish is used, one OS thread executes the computation logic while another publishes the data. (2) Both tasks are performed by the same thread using synchronous publish.
When the chain’s output is published asynchronously, we assign the corresponding publisher thread to a server
\(R_{m^\circ }\) with a time slot immediately following
Rm in time. Such a two-server scheduling of the chain enables
latency shaping, i.e., it minimizes jitters while preserving
\(\overline{L}\). Figure
1 b depicts this behavior where we get a constant end-to-end latency
\(\approx \overline{L}\). It also shows that we relax the pessimistic LET assumption of one period per task’s response.
With the more popular synchronous publish mechanism, we face a technical challenge when the chain’s last callback has a large execution time variation while it is run by the thread that also publishes the chain’s output. We solve it by architecturally extending the chain by a callback that republishes the chain’s output and running this callback in \(R_{m^\circ }\). We perform this chain extension by exploiting ROS2-supported modular software development, i.e., without any modification or recompilation of the application source code. In fact, we provide a complete tool chain automating each step from profiling to running the chain using reservation servers.
\(R_{m^\circ }\) can also be configured with additional time slots offering a
multi-latency implementation of the chain. Figure
1 c illustrates that the chain’s end-to-end latency falls in two distinct time bands when we appropriately configure
\(R_{m^\circ }\) with two time slots per cycle. Such an implementation encourages multi-mode application design where, in each mode, the chain has a specific latency. This is particularly useful for autonomous systems running in different environments. Further, we apply
particle swarm optimization [
26] to place the time slots and try to minimize the average end-to-end latency—a performance measure commonly used in the industry.
Contributions: Our main contributions are as follows:
•
We introduce the concept of latency shaping to minimize jitters in a ROS2 computation chain with a negligible impact on the observed maximum end-to-end latency. We also demonstrate its versatility in supporting multi-mode chain operation without the need for dynamic schedule reconfiguration, which is crucial for next-generation autonomous systems.
•
We propose two chain-aware implementations of latency shaping that are compatible with ROS2 semantics.
•
We develop tools that will automatically (i) determine configurations for reservation servers and (ii) create and configure them as well as assign chain components to run inside them. In essence, we enable design automation for latency shaping.
•
We apply latency shaping on a
real-world benchmark, i.e., for a chain from
Lidar to
vehicle pose estimation in Autoware’s Autonomous Valet Parking (AVP) [
20].
•
We perform experiments to demonstrate that our proposed concepts can be applied (i) to implement a ROS2 chain following the conventional LET concept and (ii) to directed acyclic graphs (DAGs) comprising ROS2 callbacks.
Paper organization: Section
2 provides system model and briefly discusses our profiling-based timing model extraction tool. Using a synthetic ROS2 chain, it further compares priority-driven and time-triggered scheduling in terms of maximum end-to-end latency and jitters. In Section
3, we describe our proposed mechanism to manage the end-to-end latency of a ROS2 chain and its different implementations. Section
4 outlines (i) our problem formulation to place time slots for a multi-band latency shaping and (ii) a tool chain to automate latency shaping. We present different case studies in Section
5. We discuss certain complex scenarios for future considerations in Section
6. We discuss the related work in Section
7 and provide concluding remarks in Section
8.
2 Background and System Model
2.1 Modeling ROS2 applications as DAGs
ROS2 is a multi-layer middleware [
35] that provides easy-to-use application program interfaces (APIs) to develop complex software applications. A major advantage is that it enables independent development of software modules that can be later easily composed together to run coherently even on a distributed platform.
In ROS2, a standalone software module which implements a particular functionality, e.g., object detection, is called a node. The main building blocks of a ROS2 node are event-handling callbacks. There are four types of callbacks, namely timer, subscriber, service, and client callbacks. Timer callbacks are triggered by periodic timer events. Communication between nodes are carried out via topics. When a node subscribes to a topic, new data on the topic triggers the designated subscriber callback to handle it. ROS2 offers a feature called service for blocking remote procedural calls (RPCs). Communications related to services are carried out using request and reply topics. A caller writes input arguments on a request topic to invoke the corresponding service callback. After processing the input, it writes its output on the reply topic that triggers the client callback.
Consider that a set of
η nodes {
Ni|1 ≤
i ≤
η} implements the software applications over ROS2. Each node
Ni is further composed of a set
Γi of
γi callbacks,
Γi = {
cbi, j|1 ≤
j ≤
γi}. We can model the applications using a DAG
\(\mathbb {G}\) where each callback
cbi, j is a vertex in
\(\mathbb {G}\). That is, the set of vertices
\(\mathbb {V}\) in
\(\mathbb {G}\) is given by
Γ1 ∪
Γ2 ∪ … ∪
Γη. We draw an edge from
cbi, j to
\(cb_{i^{\prime },j^{\prime }}\) when
cbi, j publishes on a topic
T, i.e.,
\(T \in cb_{i,j}[\mathbb {PT}]\), while
\(cb_{i^{\prime },j^{\prime }}\) subscribes to T, i.e.,
\(T \in cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]\). Further,
cbi, j can have multiple outgoing edges when (i)
\(T \in cb_{i,j}[\mathbb {PT}]\) has more than one subscribers and/or (ii)
cbi, j publishes different parts of its result on multiple topics, i.e.,
\(|cb_{i,j}[\mathbb {PT}]| \gt 1\). In the same vein,
\(cb_{i^{\prime },j^{\prime }}\) can have more than one incoming edge when (i) multiple callbacks publish on
\(T \in cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]\) and/or (ii)
\(cb_{i^{\prime },j^{\prime }}\) subscribes to two or more topics, i.e.,
\(|cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]| \gt 1\). For the latter case, we have seen an implementation of
synchronizing callbacks using message filters in Autoware [
49]. Such a callback
\(cb_{i^{\prime },j^{\prime }}\) will execute its logic only when each of the subscribed topics has a new data. Further, we note that timer callbacks have no incoming edge because they do not subscribe to any topic. Others (i.e., subscriber, service and client callbacks) can have both incoming and outgoing edges.
2.2 Scheduler and thread models
We study the single-threaded executor in ROS2, which is mostly studied in the literature [
13,
15,
44] and is also used in our benchmark application from Autoware. Such an executor runs all callbacks in a node
Ni based on a certain scheduling policy using only one thread
\(N_i[\mathcal {T}_{ex}]\). While it is possible to customize the scheduling inside executors [
3,
15], our experiments only consider the default policy which is described in [
13,
36]. This paper does not evaluate how different executor implementations influence application timings. However, we believe that our techniques can be trivially applied to all variations of single-threaded executors.
Further, the threads are scheduled by the OS. Here, we have studied two scheduling policies. (i)
\(\mathbb {S}_1\): A fixed-priority preemptive scheduling provided by
SCHED_FIFO in (
Preempt_RT-patched) Linux. When using
\(\mathbb {S}_1\), we assign a priority
\(\mathcal {T}[Pr]\) to a thread
\(\mathcal {T}\), e.g.,
\(N_i[\mathcal {T}_{ex}][Pr]\) gives the priority of the executor thread
\(N_i[\mathcal {T}_{ex}]\) of node
Ni. Here, larger the value of
\(\mathcal {T}[Pr]\) is, higher is the priority of
\(\mathcal {T}\) to run. (ii)
\(\mathbb {S}_2\): A table-driven reservation-based scheduling provided by LITMUS
RT (
LInux
Testbed for
MUltiprocessor
Scheduling in
Real-
Time systems, a Linux kernel extension) [
10]. When using
\(\mathbb {S}_2\), we allocate a thread
\(\mathcal {T}\) to a reservation server
\(\mathcal {T}[RS]\), e.g.,
\(N_i[\mathcal {T}_{ex}][RS]\) denotes the server that runs
\(N_i[\mathcal {T}_{ex}]\). Details about reservation servers are provided later in this section.
Overall, we see a hierarchical scheduling of ROS2 callbacks. That is, the OS scheduler selects an executor thread of a particular node to run while the executor then runs one of the node’s ready callbacks. Besides the executor thread, a set of threads,
\(N_i[\mathbb {T}_{DDS}]\), is also spawned by the DDS layer for a node
Ni to manage both intra- and inter-node communications [
16,
19,
41]. In this paper, we study two mechanisms to publish a data on a topic [
17]. (i)
Asynchronous publish: When a callback
cbi, j writes its output on the topic
\(T \in cb_{i,j}[\mathbb {PT}]\) using this mechanism, a publisher thread
\(N_i[\mathcal {T}_{pub}] \in N_i[\mathbb {T}_{DDS}]\) is woken up that then executes the task of publishing the data to its subscribers. (ii)
Synchronous publish: Using this mechanism, a callback
cbi, j directly publishes its output on
\(T \in cb_{i,j}[\mathbb {PT}]\), i.e., the executor thread
\(N_i[\mathcal {T}_{ex}]\) runs the logic in
cbi, j as well as the task of publishing its output to the subscribers.
2.2.1 Table-driven reservation-based scheduling.
In this scheduling environment, threads run inside reservation servers. A server Rm is defined using a set of time slots, \(R_m[Sl] = \lbrace sl_{m,1}, sl_{m,2}, \ldots , sl_{m,\nu _m}\rbrace\), and a cycle time, Rm[CT], that are statically configured. Each slot slm, n has a start time stm, n and an end time etm, n and it repeats every Rm[CT] time units. A thread \(\mathcal {T}\) assigned to \(\mathcal {T}[RS] = R_m\) can only run inside Rm’s time slots. Besides enabling timing determinism, a server constrains the amount of time for which its threads can run on the processor, thereby providing temporal isolation between different sets of threads.
We use LITMUSRT to perform our experiments. It provides real-time schedulers and synchronization mechanisms. It has a plugin P-RES implementing partitioned reservation-based scheduling that allows to define and use table-driven reservation servers. It allows to (i) assign a server Rm to a processor core Rm[PC] and (ii) map a set of threads \(R_m[\mathbb {T}]\) to it. If multiple threads are ready-to-run in a server, it implements a round-robin scheduling policy to allocate processor time to these threads. It allows to assign a priority Rm[Pr] to a server Rm. In LITMUSRT implementation, a smaller value of Rm[Pr] implies a higher priority. Priorities are useful for isolating high-critical applications from best-effort applications. That is, simultaneous to a server Rm running time-sensitive threads, we can define another server \(R_{m^{\prime }}\) with overlapping time slots and \(R_{m^{\prime }}[Pr] \gt R_{m}[Pr]\) to run best-effort threads. In this case, \(R_{m^{\prime }}\) can run its threads only when there are no ready-to-run or running threads in Rm.
This paper promotes table-driven reservation-based scheduling in LITMUSRT because it offers both timing determinism and good processor utilization for mixed-criticality workloads.
2.3 Modeling ROS2 computation chains
While it is clear that ROS2 applications can form complex DAGs, we note that this work does not promise to efficiently schedule a whole DAG. Instead, it focuses on scheduling a jitter-sensitive critical chain of ROS2 callbacks. We define a computation chain chk as an ordered sequence (chk[1], ⋅⋅⋅, chk[l], chk[l + 1], ⋅⋅⋅, chk[μk]) of μk callbacks—\(ch_k[l] \in \mathbb {V}\) is a callback—connected via topics. That is, there is a directed edge from chk[l] to chk[l + 1] in \(\mathbb {G}\), i.e., \(ch_k[l][\mathbb {PT}] \cap ch_k[l+1][\mathbb {ST}] = \lbrace T_k[l]\rbrace \ne \emptyset\). The first callback chk[1] in a chain is typically a timer callback that performs sensor data acquisition. The output of the chain is published on a topic \(T_k[\mu _k] \in ch_k[\mu _k][\mathbb {PT}]\) from which the actuator reads the data.
We are interested in minimizing the jitters
Jk in the end-to-end latency
Lk of a critical chain
chk. For a particular chained execution of callbacks in
chk,
Lk is the time between the start of
chk[1] and the instant when
chk[
μk] publishes
Tk[
μk]. We denote the maximum and the minimum observed end-to-end latency of
chk by
\(\overline{L}_k\) and
Lk, respectively. We define jitters
Jk as
\(\overline{L}_k - \underline{L}_k\) following [
12].
For the purpose of scheduling the chain, we are further concerned about the execution and response times of its callbacks. We define the execution time ek[l] as the time required to run by an instance of the callback chk[l], while we denote its maximum observed value by \(\overline{e}_k[l]\). The response time rk[l] of chk[l] is the time between its start and end for a single run and its maximum observed value is denoted by \(\overline{r}_k[l]\).
We understand that a critical chain chk is a part of the DAG because of which it experiences certain unavoidable interference and our methodology does not exclude such possibilities. (i) The node of a callback chk[l], chk[l][Nd] = Ni, may comprise other callbacks that are not part of chk, i.e., Γi∖chk ≠ ∅. Considering that \(ch_k[l][Nd][\mathcal {T}_{ex}]\) will also run these callbacks besides chk[l], they can delay the execution of chk[l] and increase Lk. (ii) If chk[l] publishes on a topic \(T \in ch_k[l][\mathbb {PT}]\) which is not subscribed to by chk[l + 1], the time spent in computing and publishing the data on T will add to etk[l], thereby increasing Lk. (iii) If chk[l] is a synchronizing callback, it might not run even after receiving the data on Tk[l − 1] published by chk[l − 1] and wait for data on other synchronized topics in \(ch_k[l][\mathbb {ST}] \setminus {T_k[l-1]}\). This again increases Lk.
We emphasize that to manage the timings of a chain chk, we only consider to (i) choose an OS scheduling policy and (ii) configure the scheduling parameters of the executor threads in \(ch_k[\mathbb {T}_{ex}] = \lbrace ch_k[l][Nd][\mathcal {T}_{ex}]| 1\le l \le \mu _k\rbrace\) and the DDS threads in \(ch_k[\mathbb {T}_{DDS}] = \cup _{l=1}^{\mu _k} ch_k[l][Nd][\mathbb {T}_{DDS}]\). This is consistent with the timing engineering process generally followed in the industry during which changes to the application code and the software platform (including middleware libraries) are not easily possible.
2.4 Profiling-based timing model extraction
As highlighted in the survey results in [
2], measurement-based timing analysis is more popularly practised in the industry. Hence, as a first work on jitter-control of ROS2 computation chains, we focus on profiling their execution in the target environment covering different scenarios. In the process, we obtain (i)
\(\overline{e}_k[l]\) and
\(\overline{r}_k[l]\) for a callback
chk[
l] in
chk and (ii)
\(\overline{L}_k\) and
Lk for
chk. Although this paper shows how our methods use measurement data, they can also be applied using analysis results obtained, e.g., by applying a combination of WCET, worst-case response time (WCRT), and end-to-end analysis techniques. However, to the best of our knowledge, such a full-fledged analytical framework is not available for industry use. WCET analysis tools either do not scale to real-world workloads, e.g., related to autonomous functionalities, or provide very pessimistic results. Also, table-driven reservation-based scheduling of ROS2 applications and the related WCRT and end-to-end latency analyses have not been studied so far.
A few works in recent years discuss the measurement of the end-to-end latency of ROS2 computation chains. (i) [
28] and [
45] directly instrument the application code and record timestamps. (ii)
Autoware_Perf [
30] and
CARET [
29] extend
ros2_tracing [
6] to add trace points in ROS2 and measure the end-to-end latency from collected traces. We follow the second approach and use [
1]—a ROS2 tracing framework based on extended Berkeley Packet Filter (eBPF). It already offers trace processing tools to (i) construct DAG representations of ROS2 applications and (ii) measure execution and response times of callbacks. Further, we extend it with our end-to-end latency measurement tool which works as described below.
For each execution of a chain
chk, our tool traverses the events related to each callback
chk[
l] in
chk, as illustrated in Figure
2. For a callback
chk[
l] except the first one, we find the following events in order:
callback_start,
data_read,
data_write, and
callback_end. We do not get any
data_read event for
chk[1] because we assume that it is a timer callback that acquires the sensor data directly via device drivers. We move from
chk[
l] to
chk[
l + 1] in
chk by matching the topic name and the source timestamp of the data in the
data_write event of
chk[
l] with the ones in the
data_read event of
chk[
l + 1]. Here, the source timestamp is used to uniquely identify a data item across the publisher and subscriber callbacks. We save the start time of
chk[1] and after traversing the chain, we find the time when
chk[
μk] publishes the chain’s output on
Tk[
μk]. We can measure
Lk as the difference between the two time instants. We shall note that the measured values of
Lk already subsume the interference experienced by the chain, which is key to understanding our proposed mechanism.
2.5 Priority- vs time-driven chain execution
This section demonstrates the timing behavior of a high-critical ROS2 computation chain ch1 when implemented using two standard real-time scheduling policies, i.e., fixed-priority and time-triggered scheduling. Here, ch1 spans five ROS2 nodes {N1, N2, N3, N4, N5} where each node Ni comprises one callback cbi, 1. Further, ch1 starts with a timer callback ch1[1] = cb1, 1 in N1 that runs every 100 ms and publishes data on a topic T1[1]. Each of the other nodes runs a subscriber callback ch1[l] = cbl, 1 that reads data from topic T1[l − 1], runs computation load, and then publishes data on T1[l]. The chain’s output is, hence, published on T1[5]. For each callback ch1[l], the execution time e1[l] varies uniformly between 5 ms and 15 ms. We run the executor threads \(ch_1[\mathbb {T}_{ex}]\) on one processor core. Additionally, we create a node N6 and run \(N_6[\mathcal {T}_{ex}]\) on the same core. It runs a timer callback cb6, 1 every 100 ms that interferes with ch1. We vary the execution time of cb6, 1 between int and \(\frac{int}{2}\) where int ∈ {10, 20, 30, 40} ms. We run the DDS threads \(ch_1[\mathbb {T}_{DDS}] \cup N_6[\mathbb {T}_{DDS}]\) on another processor core. We use two schedule configurations as follows:
(1)
Using \(\mathbb {S}_1\), we assign (i) a high priority to each executor thread in \(ch_1[\mathbb {T}_{ex}]\) and (ii) a low priority to \(N_6[\mathcal {T}_{ex}]\).
(2)
Using \(\mathbb {S}_2\), we create five high-priority servers R1, R2, R3, R4, and R5 with the time slots (in ms) sl1, 1 = [0, 16), sl2, 1 = [16, 32), sl3, 1 = [32, 48), sl4, 1 = [48, 64), and sl5, 1 = [64, 80), respectively. Further, we create a low-priority server R6 with the time slot (in ms) sl6, 1 = [0, 100). For 1 ≤ i ≤ 6, we assign Ri[CT] = 100 ms and \(R_i[\mathbb {T}]=\lbrace N_i[\mathcal {T}_{ex}]\rbrace\).
In both configurations, (1) and (2), we run the DDS threads using \(\mathbb {S}_1\) while assigning a high priority to \(\mathcal {T} \in ch_1[\mathbb {T}_{DDS}]\) and a low priority to \(\mathcal {T} \in N_6[\mathbb {T}_{DDS}]\).
We measure
L1,
\(\overline{L}_1\), and
J1. Table
1 shows the variation in
\(\overline{L}_{1}\) and
J1 with
int using (1) and (2). We have the following observations:
•
Using (1), we obtain lower values of \(\overline{L}_1\). However, we get higher values of J1 which is mainly contributed by the varying execution times—common in real-world applications—of the chain’s callbacks. Hence, using priority-driven scheduling, it is challenging to control jitters in a chain when it has time-varying workloads.
•
Compared to (1), \(\overline{L}_1\) increases using (2). This is due to the over-provisioned time slots {sli, 1|1 ≤ i ≤ 5} that are tuned based on the maximum observed response times \(\lbrace \overline{r}_1[i]|1\le i \le 5\rbrace\) of the callbacks in ch1. That is, until sli, 1 comes, ch1[i] cannot start even when ch1[i − 1] has finished execution. Nevertheless, due to such constrained progress in the chain’s execution, J1 reduces to reflect only the response time variation of the chain’s last callback. Hence, using time-triggered scheduling as in (2), jitters reduce at the expense of an increased maximum end-to-end latency.
•
Using (1), we see a slight increase in \(\overline{L}_{1}\) with increasing int. This is because \(\mathbb {S}_1\) is a practical implementation of fixed-priority preemptive scheduling with non-negligible preemption costs. Such costs are also present in (2) but are not reflected in \(\overline{L}_{1}\) due to the over-provisioned slots (1 ms longer) for running the callbacks in ch1.
Concerning the above observations, our goal is to get the best of priority-driven and time-triggered scheduling schemes. We aim to eliminate jitters in the chain while its maximum end-to-end latency remains comparable to what we get from its high-priority execution.
3 End-to-End Latency Management
This section shows how our proposed mechanism uses table-driven reservation-based scheduling to manage the end-to-end latency of a jitter-sensitive chain while considering ROS2 semantics. It follows that table-driven reservations are effective in implementing certifiable safety-critical systems [
46].
3.1 Controlling maximum end-to-end latency
Let us implement a critical chain
chk similar to (1) in Section
2.5. We run
\(\mathcal {T} \in ch_k[\mathbb {T}_{ex}]\) and
\(\mathcal {T} \in ch_k[\mathbb {T}_{DDS}]\) using
\(\mathbb {S}_1\) and a high priority. This follows from an industry practice of prioritizing critical workloads to minimize interference experience by them. We profile such a high-priority execution of
chk and measure its maximum end-to-end latency
\(\overline{L}_k(\mathbb {S}_1)\) and jitters
\(J_k(\mathbb {S}_1)\). We note that
\(\overline{L}_k(\mathbb {S}_1)\) subsumes the unavoidable interference experienced by
chk as explained in Section
2.3. We
do not consider co-scheduling multiple critical chains.
Following our goal to keep the maximum end-to-end latency close to
\(\overline{L}_k(\mathbb {S}_1)\), we define a server
Rm. Without loss of generality, we configure
Rm with a time slot
slm, 1 where:
We can also shift
slm, 1 by an offset while co-optimizing the timing performance of multiple chains, however, it is not the focus of this paper.
The cycle time Rm[CT] of Rm is equal to the period of the chain execution. We consider that \(\overline{L}_k(\mathbb {S}_1) \lt R_m[CT]\), i.e., one execution of chk can be completed within a cycle. Further, we assume that the maximum processor utilization contributed by \(\mathcal {T} \in ch_k[l][\mathbb {T}_{ex}]\) is less than 100%. That is, one CPU core can run all threads in \(ch_k[\mathbb {T}_{ex}]\) without any overload. In this paper, we study this simple case. However, when \(\overline{L}_k(\mathbb {S}_1) \gt R_m[CT]\), our method can be extended to cover two or more processor cores and define servers on each core for a pipelined execution of the chain. The placement of time slots in that case needs to consider profiles of sub-chains, which we leave as a future work.
We assign the threads running the callbacks in chk to Rm, i.e., \(R_m[\mathbb {T}] = ch_k[\mathbb {T}_{ex}]\). At the beginning of slm, 1 in a cycle, chk[1] runs. Thereafter, chk[l] can start whenever it has received the data from chk[l − 1]—similar to using \(\mathbb {S}_1\) and a high priority. Hence, chk will finish execution as soon as possible and the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\) using this technique will be approximately equal to \(\overline{L}_k(\mathbb {S}_1)\). We assign a high priority to Rm, e.g., Rm[Pr] = 1 using LITMUSRT. To improve the utilization of the processor core when chk does not run until \(\overline{L}_k(\mathbb {S}_2)\), we can define another server Rm with Rm[Pr] > Rm[Pr] and map threads performing best-effort tasks to it. In that case, Rm basically isolates chk from the interference by the best-effort tasks and we still get \(\overline{L}_k(\mathbb {S}_2) \approx \overline{L}_k(\mathbb {S}_1)\).
Experimentally, we have also studied the DDS communication threads in a ROS2 node. We have observed that a few of them—depending on the DDS implementation we use—influence the end-to-end latency by delaying the communication between the chain’s callbacks. Hence, we must isolate these threads from interference by best-effort workloads. Further, we have observed that while they run during data send and receive, they also run at other instants performing protocol-related tasks (e.g., sending heartbeat signals and polling message queues). Hence, we cannot assign them to servers with specific time slots without delaying an important DDS-related task significantly (e.g., by tens of milliseconds). Also, they typically run for a short duration (several hundred microseconds) when they wake up. They mostly run in parallel to the ROS2 threads executing the callbacks. Considering the above observations related to the DDS threads, we bind the threads in \(ch_k[\mathbb {T}_{DDS}]\) to a different processor core and use \(\mathbb {S}_1\) to schedule them with a high priority. This ensures that the DDS communication in chk will have a minimal latency.
3.2 Controlling jitters
When we schedule
chk using a high-priority server
Rm as explained above, we can minimize the maximum end-to-end latency
\(\overline{L}_k(\mathbb {S}_2)\). Now, to achieve the goal of reducing the jitters, we study two mechanisms for publishing the chain’s output, namely, asynchronous and synchronous publish, as described in Section
2.2.
3.2.1 Asynchronous publish.
When the last callback
chk[
μk] in
chk writes its output using this mechanism, the publisher thread
\(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) publishes the data to its subscribers. We propose to execute
\(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) in a server
\(R_{m^\circ }\), as illustrated in Figure
3 Design-1. We configure
\(R_{m^\circ }\) with a time slot
\(sl_{m^\circ ,1}\) given by:
Here,
etm, 1 marks the end of
slm, 1 in
Rm (see Equation
1) and
\(\overline{e}_k[pub]\) is the maximum time required to publish the
chk’s output. Here,
\(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) can be woken up any time within
slm, 1 but it cannot run until it gets the time slot
\(sl_{m^\circ ,1}\). Hence,
chk’s output is available to its subscribers only during
\(sl_m^\circ ,1\). That is, the end-to-end latency
\(L_k(\mathbb {S}_2)\) of
chk varies in between
\(\overline{L}_k(\mathbb {S}_1) + \epsilon\) and
\(\overline{L}_k(\mathbb {S}_1) + \epsilon + \overline{e}_k[pub]\). We have observed that
\(\overline{e}_k[pub] \lt 1\:\text{ms}\) which will result in
\(J_k(\mathbb {S}_2) \lt 1\:\)ms. We can further configure
\(R_{m^\circ }\) such that:
We term the above idea as latency shaping. Typically, a task is either event- or time-triggered. However, latency shaping requires a time- and event-triggered task to publish the chain’s output. That is, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) is triggered first when chk[μk] invokes an appropriate DDS API to publish data, and later, at the time instant when \(sl_{m^\circ ,1}\) starts, it is dispatched. Here, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) cannot run until both conditions are fulfilled.
3.2.2 Synchronous publish.
Using this mechanism,
chk[
μk] directly publishes
chk’s output, i.e.,
\(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) runs
chk[
μk] as well as publishes data on
Tk[
μk]. We study only static allocation of threads to servers. Now, if we keep
\(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) in
Rm, we end up getting same jitters
\(J_k(\mathbb {S}_1)\) as with the high-priority execution of
chk. Also, we can use two servers,
Rm and
\(R_{m^\circ }\), as in Section
3.2.1, where threads in
\(ch_k[\mathbb {T}_{ex}] \setminus \lbrace ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\rbrace\) run using
Rm and
\(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) uses
\(R_{m^\circ }\). Here, we can configure
slm, 1 based on the measured maximum end-to-end latency
\(\overline{L}_{k-}(\mathbb {S}_1)\)—during a high-priority execution—of the sub-chain
chk − comprising {
chk[
l]|1 ≤
l ≤
μk − 1}, while
\(sl_{m^\circ ,1}\) shall be longer than the maximum response time
rtk[
μk] of
chk[
μk]. Further,
slm, 1 is immediately followed by
\(sl_{m^\circ ,1}\) in time. However, in this design, the execution time variation in
chk[
μk] still contributes to jitters in
chk, which may not be acceptable.
Hence, we propose to add a
latency-shaping subscriber callback
chk[
μk + 1] inside a ROS2 node
chk[
μk + 1][
Nd], as shown in Figure
3 Design-2. We change the name of the topic
Tk[
μk] on which
chk[
μk] publishes. Here, we do not modify or recompile the application code, which is a crucial consideration as the application sources are often not available to a systems engineer in the industry. In fact, we can just
remap the topic name of
Tk[
μk] in the launch file from, e.g.,
Tk[
μk][
Nm] to
\(T_k[\mu _k][Nm]\_tmp\). We implement
chk[
μk + 1] to subscribe to and read data from
Tk[
μk] (i.e., the subscribed topic name must be
\(T_k[\mu _k][Nm]\_tmp\)) and publish the same data to
Tk[
μk + 1], i.e., it republishes
chk’s output. We further ensure that
Tk[
μk + 1][
Nm] =
Tk[
μk][
Nm], i.e., it has the original name of
Tk[
μk]. That is, the subscribers of
chk’s output will now read data from
Tk[
μk + 1]. This essentially extends the chain
chk to (
chk[1], …,
chk[
μk],
chk[
μk + 1]) considering that the end of its execution is identified when its output is available.
Now, we run this extended chain using two servers,
Rm and
\(R_{m^\circ }\), that execute
\(R_m[\mathbb {T}] = \lbrace ch_k[l][Nd][\mathcal {T}_{ex}]|1 \le \mu _k\rbrace\) and
\(R_{m^\circ }[\mathbb {T}] = \lbrace ch_k[\mu _k+1][Nd][\mathcal {T}_{ex}]\rbrace\), respectively. While
Rm shall comprise a slot
slm, 1 as per Equation
1,
\(R_{m^\circ }\) shall comprise a slot given by:
where,
\(\overline{r}_k[\mu _k+1]\) is the maximum observed response time of
chk[
μk + 1] to execute the logic for republishing
chk[
μk]’s output. Here,
chk[
μk + 1] only republishes a data item and does not perform any computation, thus,
\(\overline{r}_k[\mu _k+1]\) is short and impacts the maximum end-to-end latency
\(\overline{L}_k(\mathbb {S}_2)\) negligibly. Also, the time to republish should be fairly constant and, hence, the jitters
\(J_k(\mathbb {S}_2)\) are negligible.
Additionally, we note that \(ch_k[\mu _k+1][Nd][\mathbb {T}_{DDS}]\) shall be scheduled using \(\mathbb {S}_1\) and a high priority to avoid communication delays between chk[μk] and chk[μk + 1].
3.3 Multiple latency bands
So far, we have obtained one short band in which the end-to-end latency lies for each chain execution. Also, we can produce more such bands by configuring
\(R_{m^\circ }\) with multiple non-overlapping time slots, as shown in Figure
3 Design-3. The length
\(\Delta _{m^\circ }\) of each time slot is calculated from Equation
2 or
4 based on how chain’s output is published. For example,
\(\Delta _{m^\circ } = \overline{r}_k[\mu _k+1]\) if
chk[
μk] publishes synchronously. We can place time slots
\((sl_{m^\circ ,1}, sl_{m^\circ ,2}, \ldots , sl_{m^\circ ,\nu _{m^\circ }})\) in the chronological order, where
\(sl_{m^\circ ,\alpha } = [x_{m^\circ ,\alpha } - \Delta _{m^\circ }, x_{m^\circ ,\alpha })\). We consider
\(x_{m^\circ ,1} - \Delta _{m^\circ } \gt \underline{L}_k(\mathbb {S}_1)\), where
\(\underline{L}_k(\mathbb {S}_1)\) is the observed minimum end-to-end latency of
chk during its high-priority execution. Otherwise,
chk may never publish its output in
\(sl_{m^\circ ,1}\). We place
\(sl_{m^\circ ,\nu _{m^\circ }}\) in the same position as our single-band solution (i.e., following Equation
2 or
4). Hence, we do not change the maximum end-to-end latency of
chk.
Such a configuration of
\(R_{m^\circ }\) supports a multi-mode implementation of a chain
chk where, in each mode, it uses a specific slot in the cycle to publish its output. That is, we get negligible jitters in the end-to-end latency in each operating mode of
chk. An autonomous system, e.g., a car, operates in different environments or modes producing largely varying timing behavior of critical computation chains. For example, during a heavy traffic, the end-to-end latency from LIDAR/camera to steering and speed control might be long because of a long computation time in object detection and tracking. System-level performance can improve if specific latency-aware control logic is designed per mode [
37,
38] and our design supports that.
We foresee that mode switch conditions are statically defined. We consider that the active mode can be read from a topic by
chk[
μk + 1][
Nd]. It can use the knowledge to ensure that it publishes in an appropriate slot to maintain the end-to-end latency of the mode. Here, we need to put additional logic in
chk[
μk + 1][
Nd]. A prospective implementation of
chk[
μk + 1][
Nd] is briefly discussed in Section
6. The important point is that for our design, we do not need to change the server configurations online during a mode switch. We note that to co-design a multi-mode application logic and its timing design following multi-band latency shaping is a future work.
6 Future Considerations
In our problem setting as explained in Section
2.3, a critical chain may experience interference from a non-critical callback run by an executor thread that also runs a callback in the chain. For example, in Section
5.1, the
Throttler node
NTh has two callbacks,
cbrTh and
cbfTh, while only
cbrTh is in the critical chain
chrl. Both are run by
\(N_{Th}[\mathcal {T}_{ex}]\) in
RTh and
RTh[
Sl] is designed to sufficiently accommodate both callbacks every cycle. For more precise control over each callback’s execution in a node, we suggest developing it using multiple mutually exclusive
callback groups and assigning each group to a different executor using the
add_callback_group(…) function [
36].
Further, this paper does not study the case when
the chain needs to run longer than the slot duration in Rm, which will delay the next chain execution. We can adapt our design to eliminate such delays when the chain’s worst-case end-to-end latency is shorter than its period. That is, we can extend
slm, 1 to cover the entire period so that the chain will finish its computation within a period. However, the chain’s output is not published in the same period if the corresponding
\(sl_{m^\circ ,1}\) is missed. We can add an appropriate liveliness QoS for
Tk[
μk] so that it is not processed in the next period. In this case, we discard a chain execution if it overruns the estimated maximum end-to-end latency. Several works [
14,
24,
47,
48] have analyzed system safety—in terms of control stability and performance—when control updates may be skipped. In Section
3.1, we also commented on the case where the worst-case end-to-end latency is longer than the period.
While we have demonstrated how latency shaping can be combined with DAG scheduling, a better formalization of the application model and solution is future work. Also, when an executor thread runs multiple callbacks in a DAG, and it shall be statically bound to a core, partitioned DAG scheduling must consider that these callbacks run on one processing unit, which is similar to [
42].
While we can support multi-mode application design, we have not formalized the logic inside a latency-shaping node chk[μk + 1][Nd] to ensure a constant end-to-end latency in a mode. First, chk[μk + 1] must be aware of the active mode, hence, another callback in chk[μk + 1][Nd] subscribes to the corresponding topic and passes the information via a shared variable. Further, chk[μk + 1] must track in which slot it is running which is possible using a timer callback in chk[μk + 1][Nd] that updates a counter in each slot in \(R_{m^\circ }\) while sharing the value with chk[μk + 1]. By being aware of the active mode and slot, chk[μk + 1] can decide appropriately to publish or discard the output or wait for the next slot. Here, we do not dynamically update \(R_{m^\circ }\).
8 Concluding Remarks
This paper introduces the concept of latency shaping that primarily enables low-jitters implementation of a ROS2 computation chain. Compared to LET, it reduces pessimism as well as supports a multi-mode implementation of a chain. It is more practical because it uses the profiling results of the chain instead of relying on analytical frameworks. We further show how the concept can be implemented considering different mechanisms to publish data in ROS2. An important aspect of our proposed idea is that it does not require to modify or recompile the application code, which is crucial requirement to preserve the separation of concerns between application development and timing engineering in the industry.
While, in this work, we have considered static allocation of threads to reservation servers, in the future, we intend to explore dynamic allocation for the last callback in the chain so that we can split the computation and data publish tasks of the thread into two servers, thereby eliminating the use of an additional latency-shaping callback. It would be also interesting to explore co-design of control logic and latency shaping. While, we have considered computation chains over ROS2, DDS, and Linux, we believe that the idea is generic and can be applied to other middlewares and operating systems as well.