Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3696355.3696363acmotherconferencesArticle/Chapter ViewFull TextPublication PagesrtnsConference Proceedingsconference-collections
research-article
Open access

Managing End-to-End Timing Jitters in ROS2 Computation Chains

Published: 03 January 2025 Publication History

Abstract

Typically, in a cyber-physical system (CPS), timing jitters between sensing and actuation adversely affect its physical behavior. Logical execution time (LET) paradigm has gained industry attention because it offers zero jitters in tasks’ response time. In autonomous CPS such as self-driving cars, Robot Operating System 2 (ROS2) is becoming a popular software platform to implement computation tasks. Towards a practical LET implementation of ROS2 computation chains, we propose to use at least two table-driven reservation servers: (i) one exclusively runs the processor thread that publishes the chain’s output while (ii) others isolate the chain’s main computations from best-effort workloads on the same processing unit. We show how to architecturally adapt the chain non-intrusively as well as dimension the servers and allocate appropriate threads to them so that we obtain negligible jitters while keeping the observed maximum end-to-end latency comparable to a high-priority execution of the chain. Also, our approach is versatile and can produce latency bands, thereby offering an opportunity to co-optimize jitters and average end-to-end latency. This further supports multi-mode application design, which is especially important for a high-performance operation in different environments, e.g., city and highway driving. Our idea does not involve modification or recompilation of the application code or ROS2 libraries, which is crucial in industry settings. We have also developed tools to automate our approach. By applying our proposed mechanism on a real-world benchmark implementing Lidar-based localization, we maintain a constant end-to-end latency that is even 13% shorter than an improved LET implementation.

1 Introduction

The advancement in artificial intelligence (AI) and computer vision (CV) is driving cyber-physical systems (CPS) towards autonomy, e.g., self-driving cars, drones, and robots. Robot Operating System 2 (ROS2) is becoming pivotal in this drive due to (i) its vast repertoire of open-source implementations of AI and CV algorithms, (ii) its portability to different operating systems (OS), and (iii) its support for collaborative development of complex software applications.
Software applications over ROS2 consist of modular functional components called nodes. A node comprises callbacks to handle timer and communication events. The nodes send and receive data using topics following Data Distribution Service (DDS) standard. An application can contain computation chains formed by a sequential invocation of callbacks connected via topics. In autonomous systems, a chain reads and processes sensor data, performs planning, executes control logic, and applies actuation signals. [20].
Figure 1:
Figure 1: Latency shaping vs conventional logical execution time implementation.
A long and unpredictable end-to-end latency of such a chain can lead to a lower performance or even to unsafe physical behavior [31, 39]. Static and dynamic priority-based scheduling are studied in ROS2 to reduce the end-to-end latency, e.g., [3, 9, 15, 50]. However, none of the existing works consider to minimize the jitters or variations in the end-to-end latency of ROS2 chains.
Typically, jitters are eliminated using techniques [7, 18, 21, 22, 33, 34] following the Logical Execution Time (LET) concept [25]—the de-facto industry standard, e.g., see AUTOSAR [4]. Figure 1 a shows a conventional LET implementation where, at the beginning of a period, a task first writes its output from the previous execution, then reads the new input and starts processing it. Hence, a task spends a constant time—equal to its period—between data read and write. Also, a chain has a constant end-to-end latency. Nevertheless, existing LET implementations are not compatible with ROS2 chains due to the following reasons: (i) They require to control the timings of reading from and writing to topics, which is non-trivial for the more popular DDS communication mechanism. (ii) They require the callbacks to be time-triggered, which is not the case when one subscribes to a topic.
Proposed solution: Towards a practical LET implementation of a ROS2 chain, we use table-driven reservation servers. Such a server reserves specific time slots on a processing unit to run its assigned threads. We execute a high-critical (sub-)chain using high-priority servers and run best-effort workloads in low-priority servers, i.e., we consider hierarchical scheduling. This offers a high real-time performance as well as a good processor utilization. Unlike conventional time-triggered scheduling where computation time is reserved for each task based on its worst-case execution time (WCET), we assign chained callbacks to a server Rm, capable of running multiple threads. We dimension Rm based on the maximum measured end-to-end latency \(\overline{L}\) of the assigned chain during a priority-driven execution. Such measurement-based approaches are popular in the industry [2].
Further, we study two DDS communication mechanisms. (1) When asynchronous publish is used, one OS thread executes the computation logic while another publishes the data. (2) Both tasks are performed by the same thread using synchronous publish.
When the chain’s output is published asynchronously, we assign the corresponding publisher thread to a server \(R_{m^\circ }\) with a time slot immediately following Rm in time. Such a two-server scheduling of the chain enables latency shaping, i.e., it minimizes jitters while preserving \(\overline{L}\). Figure 1 b depicts this behavior where we get a constant end-to-end latency \(\approx \overline{L}\). It also shows that we relax the pessimistic LET assumption of one period per task’s response.
With the more popular synchronous publish mechanism, we face a technical challenge when the chain’s last callback has a large execution time variation while it is run by the thread that also publishes the chain’s output. We solve it by architecturally extending the chain by a callback that republishes the chain’s output and running this callback in \(R_{m^\circ }\). We perform this chain extension by exploiting ROS2-supported modular software development, i.e., without any modification or recompilation of the application source code. In fact, we provide a complete tool chain automating each step from profiling to running the chain using reservation servers.
\(R_{m^\circ }\) can also be configured with additional time slots offering a multi-latency implementation of the chain. Figure 1 c illustrates that the chain’s end-to-end latency falls in two distinct time bands when we appropriately configure \(R_{m^\circ }\) with two time slots per cycle. Such an implementation encourages multi-mode application design where, in each mode, the chain has a specific latency. This is particularly useful for autonomous systems running in different environments. Further, we apply particle swarm optimization [26] to place the time slots and try to minimize the average end-to-end latency—a performance measure commonly used in the industry.
Contributions: Our main contributions are as follows:
We introduce the concept of latency shaping to minimize jitters in a ROS2 computation chain with a negligible impact on the observed maximum end-to-end latency. We also demonstrate its versatility in supporting multi-mode chain operation without the need for dynamic schedule reconfiguration, which is crucial for next-generation autonomous systems.
We propose two chain-aware implementations of latency shaping that are compatible with ROS2 semantics.
We develop tools that will automatically (i) determine configurations for reservation servers and (ii) create and configure them as well as assign chain components to run inside them. In essence, we enable design automation for latency shaping.
We apply latency shaping on a real-world benchmark, i.e., for a chain from Lidar to vehicle pose estimation in Autoware’s Autonomous Valet Parking (AVP) [20].
We perform experiments to demonstrate that our proposed concepts can be applied (i) to implement a ROS2 chain following the conventional LET concept and (ii) to directed acyclic graphs (DAGs) comprising ROS2 callbacks.
Paper organization: Section 2 provides system model and briefly discusses our profiling-based timing model extraction tool. Using a synthetic ROS2 chain, it further compares priority-driven and time-triggered scheduling in terms of maximum end-to-end latency and jitters. In Section 3, we describe our proposed mechanism to manage the end-to-end latency of a ROS2 chain and its different implementations. Section 4 outlines (i) our problem formulation to place time slots for a multi-band latency shaping and (ii) a tool chain to automate latency shaping. We present different case studies in Section 5. We discuss certain complex scenarios for future considerations in Section 6. We discuss the related work in Section 7 and provide concluding remarks in Section 8.

2 Background and System Model

2.1 Modeling ROS2 applications as DAGs

ROS2 is a multi-layer middleware [35] that provides easy-to-use application program interfaces (APIs) to develop complex software applications. A major advantage is that it enables independent development of software modules that can be later easily composed together to run coherently even on a distributed platform.
In ROS2, a standalone software module which implements a particular functionality, e.g., object detection, is called a node. The main building blocks of a ROS2 node are event-handling callbacks. There are four types of callbacks, namely timer, subscriber, service, and client callbacks. Timer callbacks are triggered by periodic timer events. Communication between nodes are carried out via topics. When a node subscribes to a topic, new data on the topic triggers the designated subscriber callback to handle it. ROS2 offers a feature called service for blocking remote procedural calls (RPCs). Communications related to services are carried out using request and reply topics. A caller writes input arguments on a request topic to invoke the corresponding service callback. After processing the input, it writes its output on the reply topic that triggers the client callback.
Consider that a set of η nodes {Ni|1 ≤ iη} implements the software applications over ROS2. Each node Ni is further composed of a set Γi of γi callbacks, Γi = {cbi, j|1 ≤ jγi}. We can model the applications using a DAG \(\mathbb {G}\) where each callback cbi, j is a vertex in \(\mathbb {G}\). That is, the set of vertices \(\mathbb {V}\) in \(\mathbb {G}\) is given by Γ1 ∪ Γ2 ∪ … ∪ Γη. We draw an edge from cbi, j to \(cb_{i^{\prime },j^{\prime }}\) when cbi, j publishes on a topic T, i.e., \(T \in cb_{i,j}[\mathbb {PT}]\), while \(cb_{i^{\prime },j^{\prime }}\) subscribes to T, i.e., \(T \in cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]\). Further, cbi, j can have multiple outgoing edges when (i) \(T \in cb_{i,j}[\mathbb {PT}]\) has more than one subscribers and/or (ii) cbi, j publishes different parts of its result on multiple topics, i.e., \(|cb_{i,j}[\mathbb {PT}]| \gt 1\). In the same vein, \(cb_{i^{\prime },j^{\prime }}\) can have more than one incoming edge when (i) multiple callbacks publish on \(T \in cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]\) and/or (ii) \(cb_{i^{\prime },j^{\prime }}\) subscribes to two or more topics, i.e., \(|cb_{i^{\prime },j^{\prime }}[\mathbb {ST}]| \gt 1\). For the latter case, we have seen an implementation of synchronizing callbacks using message filters in Autoware [49]. Such a callback \(cb_{i^{\prime },j^{\prime }}\) will execute its logic only when each of the subscribed topics has a new data. Further, we note that timer callbacks have no incoming edge because they do not subscribe to any topic. Others (i.e., subscriber, service and client callbacks) can have both incoming and outgoing edges.

2.2 Scheduler and thread models

We study the single-threaded executor in ROS2, which is mostly studied in the literature [13, 15, 44] and is also used in our benchmark application from Autoware. Such an executor runs all callbacks in a node Ni based on a certain scheduling policy using only one thread \(N_i[\mathcal {T}_{ex}]\). While it is possible to customize the scheduling inside executors [3, 15], our experiments only consider the default policy which is described in [13, 36]. This paper does not evaluate how different executor implementations influence application timings. However, we believe that our techniques can be trivially applied to all variations of single-threaded executors.
Further, the threads are scheduled by the OS. Here, we have studied two scheduling policies. (i) \(\mathbb {S}_1\): A fixed-priority preemptive scheduling provided by SCHED_FIFO in (Preempt_RT-patched) Linux. When using \(\mathbb {S}_1\), we assign a priority \(\mathcal {T}[Pr]\) to a thread \(\mathcal {T}\), e.g., \(N_i[\mathcal {T}_{ex}][Pr]\) gives the priority of the executor thread \(N_i[\mathcal {T}_{ex}]\) of node Ni. Here, larger the value of \(\mathcal {T}[Pr]\) is, higher is the priority of \(\mathcal {T}\) to run. (ii) \(\mathbb {S}_2\): A table-driven reservation-based scheduling provided by LITMUSRT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems, a Linux kernel extension) [10]. When using \(\mathbb {S}_2\), we allocate a thread \(\mathcal {T}\) to a reservation server \(\mathcal {T}[RS]\), e.g., \(N_i[\mathcal {T}_{ex}][RS]\) denotes the server that runs \(N_i[\mathcal {T}_{ex}]\). Details about reservation servers are provided later in this section.
Overall, we see a hierarchical scheduling of ROS2 callbacks. That is, the OS scheduler selects an executor thread of a particular node to run while the executor then runs one of the node’s ready callbacks. Besides the executor thread, a set of threads, \(N_i[\mathbb {T}_{DDS}]\), is also spawned by the DDS layer for a node Ni to manage both intra- and inter-node communications [16, 19, 41]. In this paper, we study two mechanisms to publish a data on a topic [17]. (i) Asynchronous publish: When a callback cbi, j writes its output on the topic \(T \in cb_{i,j}[\mathbb {PT}]\) using this mechanism, a publisher thread \(N_i[\mathcal {T}_{pub}] \in N_i[\mathbb {T}_{DDS}]\) is woken up that then executes the task of publishing the data to its subscribers. (ii) Synchronous publish: Using this mechanism, a callback cbi, j directly publishes its output on \(T \in cb_{i,j}[\mathbb {PT}]\), i.e., the executor thread \(N_i[\mathcal {T}_{ex}]\) runs the logic in cbi, j as well as the task of publishing its output to the subscribers.

2.2.1 Table-driven reservation-based scheduling.

In this scheduling environment, threads run inside reservation servers. A server Rm is defined using a set of time slots, \(R_m[Sl] = \lbrace sl_{m,1}, sl_{m,2}, \ldots , sl_{m,\nu _m}\rbrace\), and a cycle time, Rm[CT], that are statically configured. Each slot slm, n has a start time stm, n and an end time etm, n and it repeats every Rm[CT] time units. A thread \(\mathcal {T}\) assigned to \(\mathcal {T}[RS] = R_m\) can only run inside Rm’s time slots. Besides enabling timing determinism, a server constrains the amount of time for which its threads can run on the processor, thereby providing temporal isolation between different sets of threads.
We use LITMUSRT to perform our experiments. It provides real-time schedulers and synchronization mechanisms. It has a plugin P-RES implementing partitioned reservation-based scheduling that allows to define and use table-driven reservation servers. It allows to (i) assign a server Rm to a processor core Rm[PC] and (ii) map a set of threads \(R_m[\mathbb {T}]\) to it. If multiple threads are ready-to-run in a server, it implements a round-robin scheduling policy to allocate processor time to these threads. It allows to assign a priority Rm[Pr] to a server Rm. In LITMUSRT implementation, a smaller value of Rm[Pr] implies a higher priority. Priorities are useful for isolating high-critical applications from best-effort applications. That is, simultaneous to a server Rm running time-sensitive threads, we can define another server \(R_{m^{\prime }}\) with overlapping time slots and \(R_{m^{\prime }}[Pr] \gt R_{m}[Pr]\) to run best-effort threads. In this case, \(R_{m^{\prime }}\) can run its threads only when there are no ready-to-run or running threads in Rm.
This paper promotes table-driven reservation-based scheduling in LITMUSRT because it offers both timing determinism and good processor utilization for mixed-criticality workloads.

2.3 Modeling ROS2 computation chains

While it is clear that ROS2 applications can form complex DAGs, we note that this work does not promise to efficiently schedule a whole DAG. Instead, it focuses on scheduling a jitter-sensitive critical chain of ROS2 callbacks. We define a computation chain chk as an ordered sequence (chk[1], ⋅⋅⋅, chk[l], chk[l + 1], ⋅⋅⋅, chk[μk]) of μk callbacks—\(ch_k[l] \in \mathbb {V}\) is a callback—connected via topics. That is, there is a directed edge from chk[l] to chk[l + 1] in \(\mathbb {G}\), i.e., \(ch_k[l][\mathbb {PT}] \cap ch_k[l+1][\mathbb {ST}] = \lbrace T_k[l]\rbrace \ne \emptyset\). The first callback chk[1] in a chain is typically a timer callback that performs sensor data acquisition. The output of the chain is published on a topic \(T_k[\mu _k] \in ch_k[\mu _k][\mathbb {PT}]\) from which the actuator reads the data.
We are interested in minimizing the jitters Jk in the end-to-end latency Lk of a critical chain chk. For a particular chained execution of callbacks in chk, Lk is the time between the start of chk[1] and the instant when chk[μk] publishes Tk[μk]. We denote the maximum and the minimum observed end-to-end latency of chk by \(\overline{L}_k\) and Lk, respectively. We define jitters Jk as \(\overline{L}_k - \underline{L}_k\) following [12].
For the purpose of scheduling the chain, we are further concerned about the execution and response times of its callbacks. We define the execution time ek[l] as the time required to run by an instance of the callback chk[l], while we denote its maximum observed value by \(\overline{e}_k[l]\). The response time rk[l] of chk[l] is the time between its start and end for a single run and its maximum observed value is denoted by \(\overline{r}_k[l]\).
We understand that a critical chain chk is a part of the DAG because of which it experiences certain unavoidable interference and our methodology does not exclude such possibilities. (i) The node of a callback chk[l], chk[l][Nd] = Ni, may comprise other callbacks that are not part of chk, i.e., Γichk ≠ ∅. Considering that \(ch_k[l][Nd][\mathcal {T}_{ex}]\) will also run these callbacks besides chk[l], they can delay the execution of chk[l] and increase Lk. (ii) If chk[l] publishes on a topic \(T \in ch_k[l][\mathbb {PT}]\) which is not subscribed to by chk[l + 1], the time spent in computing and publishing the data on T will add to etk[l], thereby increasing Lk. (iii) If chk[l] is a synchronizing callback, it might not run even after receiving the data on Tk[l − 1] published by chk[l − 1] and wait for data on other synchronized topics in \(ch_k[l][\mathbb {ST}] \setminus {T_k[l-1]}\). This again increases Lk.
We emphasize that to manage the timings of a chain chk, we only consider to (i) choose an OS scheduling policy and (ii) configure the scheduling parameters of the executor threads in \(ch_k[\mathbb {T}_{ex}] = \lbrace ch_k[l][Nd][\mathcal {T}_{ex}]| 1\le l \le \mu _k\rbrace\) and the DDS threads in \(ch_k[\mathbb {T}_{DDS}] = \cup _{l=1}^{\mu _k} ch_k[l][Nd][\mathbb {T}_{DDS}]\). This is consistent with the timing engineering process generally followed in the industry during which changes to the application code and the software platform (including middleware libraries) are not easily possible.
Figure 2:
Figure 2: Traversing events related to each callback in an execution of a chain to measure the end-to-end latency.

2.4 Profiling-based timing model extraction

As highlighted in the survey results in [2], measurement-based timing analysis is more popularly practised in the industry. Hence, as a first work on jitter-control of ROS2 computation chains, we focus on profiling their execution in the target environment covering different scenarios. In the process, we obtain (i) \(\overline{e}_k[l]\) and \(\overline{r}_k[l]\) for a callback chk[l] in chk and (ii) \(\overline{L}_k\) and Lk for chk. Although this paper shows how our methods use measurement data, they can also be applied using analysis results obtained, e.g., by applying a combination of WCET, worst-case response time (WCRT), and end-to-end analysis techniques. However, to the best of our knowledge, such a full-fledged analytical framework is not available for industry use. WCET analysis tools either do not scale to real-world workloads, e.g., related to autonomous functionalities, or provide very pessimistic results. Also, table-driven reservation-based scheduling of ROS2 applications and the related WCRT and end-to-end latency analyses have not been studied so far.
A few works in recent years discuss the measurement of the end-to-end latency of ROS2 computation chains. (i) [28] and [45] directly instrument the application code and record timestamps. (ii) Autoware_Perf [30] and CARET [29] extend ros2_tracing [6] to add trace points in ROS2 and measure the end-to-end latency from collected traces. We follow the second approach and use [1]—a ROS2 tracing framework based on extended Berkeley Packet Filter (eBPF). It already offers trace processing tools to (i) construct DAG representations of ROS2 applications and (ii) measure execution and response times of callbacks. Further, we extend it with our end-to-end latency measurement tool which works as described below.
For each execution of a chain chk, our tool traverses the events related to each callback chk[l] in chk, as illustrated in Figure 2. For a callback chk[l] except the first one, we find the following events in order: callback_start, data_read, data_write, and callback_end. We do not get any data_read event for chk[1] because we assume that it is a timer callback that acquires the sensor data directly via device drivers. We move from chk[l] to chk[l + 1] in chk by matching the topic name and the source timestamp of the data in the data_write event of chk[l] with the ones in the data_read event of chk[l + 1]. Here, the source timestamp is used to uniquely identify a data item across the publisher and subscriber callbacks. We save the start time of chk[1] and after traversing the chain, we find the time when chk[μk] publishes the chain’s output on Tk[μk]. We can measure Lk as the difference between the two time instants. We shall note that the measured values of Lk already subsume the interference experienced by the chain, which is key to understanding our proposed mechanism.

2.5 Priority- vs time-driven chain execution

This section demonstrates the timing behavior of a high-critical ROS2 computation chain ch1 when implemented using two standard real-time scheduling policies, i.e., fixed-priority and time-triggered scheduling. Here, ch1 spans five ROS2 nodes {N1, N2, N3, N4, N5} where each node Ni comprises one callback cbi, 1. Further, ch1 starts with a timer callback ch1[1] = cb1, 1 in N1 that runs every 100 ms and publishes data on a topic T1[1]. Each of the other nodes runs a subscriber callback ch1[l] = cbl, 1 that reads data from topic T1[l − 1], runs computation load, and then publishes data on T1[l]. The chain’s output is, hence, published on T1[5]. For each callback ch1[l], the execution time e1[l] varies uniformly between 5 ms and 15 ms. We run the executor threads \(ch_1[\mathbb {T}_{ex}]\) on one processor core. Additionally, we create a node N6 and run \(N_6[\mathcal {T}_{ex}]\) on the same core. It runs a timer callback cb6, 1 every 100 ms that interferes with ch1. We vary the execution time of cb6, 1 between int and \(\frac{int}{2}\) where int ∈ {10, 20, 30, 40} ms. We run the DDS threads \(ch_1[\mathbb {T}_{DDS}] \cup N_6[\mathbb {T}_{DDS}]\) on another processor core. We use two schedule configurations as follows:
(1)
Using \(\mathbb {S}_1\), we assign (i) a high priority to each executor thread in \(ch_1[\mathbb {T}_{ex}]\) and (ii) a low priority to \(N_6[\mathcal {T}_{ex}]\).
(2)
Using \(\mathbb {S}_2\), we create five high-priority servers R1, R2, R3, R4, and R5 with the time slots (in ms) sl1, 1 = [0, 16), sl2, 1 = [16, 32), sl3, 1 = [32, 48), sl4, 1 = [48, 64), and sl5, 1 = [64, 80), respectively. Further, we create a low-priority server R6 with the time slot (in ms) sl6, 1 = [0, 100). For 1 ≤ i ≤ 6, we assign Ri[CT] = 100 ms and \(R_i[\mathbb {T}]=\lbrace N_i[\mathcal {T}_{ex}]\rbrace\).
In both configurations, (1) and (2), we run the DDS threads using \(\mathbb {S}_1\) while assigning a high priority to \(\mathcal {T} \in ch_1[\mathbb {T}_{DDS}]\) and a low priority to \(\mathcal {T} \in N_6[\mathbb {T}_{DDS}]\).
Table 1:
Interfering load per 100 ms010203040
Priority-driven scheduling lat.57.358.158.759.862.8
Priority-driven scheduling jit.32.433.434.235.436.2
Time-driven scheduling lat.≈ 80 ms
Time-driven scheduling jit.≈ 10 ms
Table 1: Measured maximum end-to-end latency (lat.) and jitters (jit.) in ms.
We measure L1, \(\overline{L}_1\), and J1. Table 1 shows the variation in \(\overline{L}_{1}\) and J1 with int using (1) and (2). We have the following observations:
Using (1), we obtain lower values of \(\overline{L}_1\). However, we get higher values of J1 which is mainly contributed by the varying execution times—common in real-world applications—of the chain’s callbacks. Hence, using priority-driven scheduling, it is challenging to control jitters in a chain when it has time-varying workloads.
Compared to (1), \(\overline{L}_1\) increases using (2). This is due to the over-provisioned time slots {sli, 1|1 ≤ i ≤ 5} that are tuned based on the maximum observed response times \(\lbrace \overline{r}_1[i]|1\le i \le 5\rbrace\) of the callbacks in ch1. That is, until sli, 1 comes, ch1[i] cannot start even when ch1[i − 1] has finished execution. Nevertheless, due to such constrained progress in the chain’s execution, J1 reduces to reflect only the response time variation of the chain’s last callback. Hence, using time-triggered scheduling as in (2), jitters reduce at the expense of an increased maximum end-to-end latency.
Using (1), we see a slight increase in \(\overline{L}_{1}\) with increasing int. This is because \(\mathbb {S}_1\) is a practical implementation of fixed-priority preemptive scheduling with non-negligible preemption costs. Such costs are also present in (2) but are not reflected in \(\overline{L}_{1}\) due to the over-provisioned slots (1 ms longer) for running the callbacks in ch1.
Concerning the above observations, our goal is to get the best of priority-driven and time-triggered scheduling schemes. We aim to eliminate jitters in the chain while its maximum end-to-end latency remains comparable to what we get from its high-priority execution.

3 End-to-End Latency Management

This section shows how our proposed mechanism uses table-driven reservation-based scheduling to manage the end-to-end latency of a jitter-sensitive chain while considering ROS2 semantics. It follows that table-driven reservations are effective in implementing certifiable safety-critical systems [46].

3.1 Controlling maximum end-to-end latency

Let us implement a critical chain chk similar to (1) in Section 2.5. We run \(\mathcal {T} \in ch_k[\mathbb {T}_{ex}]\) and \(\mathcal {T} \in ch_k[\mathbb {T}_{DDS}]\) using \(\mathbb {S}_1\) and a high priority. This follows from an industry practice of prioritizing critical workloads to minimize interference experience by them. We profile such a high-priority execution of chk and measure its maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_1)\) and jitters \(J_k(\mathbb {S}_1)\). We note that \(\overline{L}_k(\mathbb {S}_1)\) subsumes the unavoidable interference experienced by chk as explained in Section 2.3. We do not consider co-scheduling multiple critical chains.
Following our goal to keep the maximum end-to-end latency close to \(\overline{L}_k(\mathbb {S}_1)\), we define a server Rm. Without loss of generality, we configure Rm with a time slot slm, 1 where:
\begin{equation}sl_{m,1} = [0, \overline{L}_1(\mathbb {S}_1) + \epsilon ]. \end{equation}
(1)
We can also shift slm, 1 by an offset while co-optimizing the timing performance of multiple chains, however, it is not the focus of this paper.
The cycle time Rm[CT] of Rm is equal to the period of the chain execution. We consider that \(\overline{L}_k(\mathbb {S}_1) \lt R_m[CT]\), i.e., one execution of chk can be completed within a cycle. Further, we assume that the maximum processor utilization contributed by \(\mathcal {T} \in ch_k[l][\mathbb {T}_{ex}]\) is less than 100%. That is, one CPU core can run all threads in \(ch_k[\mathbb {T}_{ex}]\) without any overload. In this paper, we study this simple case. However, when \(\overline{L}_k(\mathbb {S}_1) \gt R_m[CT]\), our method can be extended to cover two or more processor cores and define servers on each core for a pipelined execution of the chain. The placement of time slots in that case needs to consider profiles of sub-chains, which we leave as a future work.
We assign the threads running the callbacks in chk to Rm, i.e., \(R_m[\mathbb {T}] = ch_k[\mathbb {T}_{ex}]\). At the beginning of slm, 1 in a cycle, chk[1] runs. Thereafter, chk[l] can start whenever it has received the data from chk[l − 1]—similar to using \(\mathbb {S}_1\) and a high priority. Hence, chk will finish execution as soon as possible and the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\) using this technique will be approximately equal to \(\overline{L}_k(\mathbb {S}_1)\). We assign a high priority to Rm, e.g., Rm[Pr] = 1 using LITMUSRT. To improve the utilization of the processor core when chk does not run until \(\overline{L}_k(\mathbb {S}_2)\), we can define another server Rm with Rm[Pr] > Rm[Pr] and map threads performing best-effort tasks to it. In that case, Rm basically isolates chk from the interference by the best-effort tasks and we still get \(\overline{L}_k(\mathbb {S}_2) \approx \overline{L}_k(\mathbb {S}_1)\).
Experimentally, we have also studied the DDS communication threads in a ROS2 node. We have observed that a few of them—depending on the DDS implementation we use—influence the end-to-end latency by delaying the communication between the chain’s callbacks. Hence, we must isolate these threads from interference by best-effort workloads. Further, we have observed that while they run during data send and receive, they also run at other instants performing protocol-related tasks (e.g., sending heartbeat signals and polling message queues). Hence, we cannot assign them to servers with specific time slots without delaying an important DDS-related task significantly (e.g., by tens of milliseconds). Also, they typically run for a short duration (several hundred microseconds) when they wake up. They mostly run in parallel to the ROS2 threads executing the callbacks. Considering the above observations related to the DDS threads, we bind the threads in \(ch_k[\mathbb {T}_{DDS}]\) to a different processor core and use \(\mathbb {S}_1\) to schedule them with a high priority. This ensures that the DDS communication in chk will have a minimal latency.
Figure 3:
Figure 3: Chain-aware single- and multi-band latency shaping.

3.2 Controlling jitters

When we schedule chk using a high-priority server Rm as explained above, we can minimize the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\). Now, to achieve the goal of reducing the jitters, we study two mechanisms for publishing the chain’s output, namely, asynchronous and synchronous publish, as described in Section 2.2.

3.2.1 Asynchronous publish.

When the last callback chk[μk] in chk writes its output using this mechanism, the publisher thread \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) publishes the data to its subscribers. We propose to execute \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) in a server \(R_{m^\circ }\), as illustrated in Figure 3 Design-1. We configure \(R_{m^\circ }\) with a time slot \(sl_{m^\circ ,1}\) given by:
\begin{equation}sl_{m^\circ ,1} = \Big [et_{m,1}, et_{m,1} + \overline{e}_k[pub]\Big ]. \end{equation}
(2)
Here, etm, 1 marks the end of slm, 1 in Rm (see Equation 1) and \(\overline{e}_k[pub]\) is the maximum time required to publish the chk’s output. Here, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) can be woken up any time within slm, 1 but it cannot run until it gets the time slot \(sl_{m^\circ ,1}\). Hence, chk’s output is available to its subscribers only during \(sl_m^\circ ,1\). That is, the end-to-end latency \(L_k(\mathbb {S}_2)\) of chk varies in between \(\overline{L}_k(\mathbb {S}_1) + \epsilon\) and \(\overline{L}_k(\mathbb {S}_1) + \epsilon + \overline{e}_k[pub]\). We have observed that \(\overline{e}_k[pub] \lt 1\:\text{ms}\) which will result in \(J_k(\mathbb {S}_2) \lt 1\:\)ms. We can further configure \(R_{m^\circ }\) such that:
\begin{equation}R_{m^\circ }[Pr] = R_{m}[Pr]; R_{m^\circ }[PC] = R_{m}[PC]; R_{m^\circ }[CT] = R_{m}[CT]. \end{equation}
(3)
We term the above idea as latency shaping. Typically, a task is either event- or time-triggered. However, latency shaping requires a time- and event-triggered task to publish the chain’s output. That is, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) is triggered first when chk[μk] invokes an appropriate DDS API to publish data, and later, at the time instant when \(sl_{m^\circ ,1}\) starts, it is dispatched. Here, \(ch_k[\mu _k][Nd][\mathcal {T}_{pub}]\) cannot run until both conditions are fulfilled.

3.2.2 Synchronous publish.

Using this mechanism, chk[μk] directly publishes chk’s output, i.e., \(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) runs chk[μk] as well as publishes data on Tk[μk]. We study only static allocation of threads to servers. Now, if we keep \(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) in Rm, we end up getting same jitters \(J_k(\mathbb {S}_1)\) as with the high-priority execution of chk. Also, we can use two servers, Rm and \(R_{m^\circ }\), as in Section 3.2.1, where threads in \(ch_k[\mathbb {T}_{ex}] \setminus \lbrace ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\rbrace\) run using Rm and \(ch_k[\mu _k][Nd][\mathcal {T}_{ex}]\) uses \(R_{m^\circ }\). Here, we can configure slm, 1 based on the measured maximum end-to-end latency \(\overline{L}_{k-}(\mathbb {S}_1)\)—during a high-priority execution—of the sub-chain chk comprising {chk[l]|1 ≤ lμk − 1}, while \(sl_{m^\circ ,1}\) shall be longer than the maximum response time rtk[μk] of chk[μk]. Further, slm, 1 is immediately followed by \(sl_{m^\circ ,1}\) in time. However, in this design, the execution time variation in chk[μk] still contributes to jitters in chk, which may not be acceptable.
Hence, we propose to add a latency-shaping subscriber callback chk[μk + 1] inside a ROS2 node chk[μk + 1][Nd], as shown in Figure 3 Design-2. We change the name of the topic Tk[μk] on which chk[μk] publishes. Here, we do not modify or recompile the application code, which is a crucial consideration as the application sources are often not available to a systems engineer in the industry. In fact, we can just remap the topic name of Tk[μk] in the launch file from, e.g., Tk[μk][Nm] to \(T_k[\mu _k][Nm]\_tmp\). We implement chk[μk + 1] to subscribe to and read data from Tk[μk] (i.e., the subscribed topic name must be \(T_k[\mu _k][Nm]\_tmp\)) and publish the same data to Tk[μk + 1], i.e., it republishes chk’s output. We further ensure that Tk[μk + 1][Nm] = Tk[μk][Nm], i.e., it has the original name of Tk[μk]. That is, the subscribers of chk’s output will now read data from Tk[μk + 1]. This essentially extends the chain chk to (chk[1], …, chk[μk], chk[μk + 1]) considering that the end of its execution is identified when its output is available.
Now, we run this extended chain using two servers, Rm and \(R_{m^\circ }\), that execute \(R_m[\mathbb {T}] = \lbrace ch_k[l][Nd][\mathcal {T}_{ex}]|1 \le \mu _k\rbrace\) and \(R_{m^\circ }[\mathbb {T}] = \lbrace ch_k[\mu _k+1][Nd][\mathcal {T}_{ex}]\rbrace\), respectively. While Rm shall comprise a slot slm, 1 as per Equation 1, \(R_{m^\circ }\) shall comprise a slot given by:
\begin{equation}sl_{m^\circ ,1} = \Big [et_{m,1},et_{m,1} + \overline{r}_k[\mu _k+1]\Big ], \end{equation}
(4)
where, \(\overline{r}_k[\mu _k+1]\) is the maximum observed response time of chk[μk + 1] to execute the logic for republishing chk[μk]’s output. Here, chk[μk + 1] only republishes a data item and does not perform any computation, thus, \(\overline{r}_k[\mu _k+1]\) is short and impacts the maximum end-to-end latency \(\overline{L}_k(\mathbb {S}_2)\) negligibly. Also, the time to republish should be fairly constant and, hence, the jitters \(J_k(\mathbb {S}_2)\) are negligible.
Additionally, we note that \(ch_k[\mu _k+1][Nd][\mathbb {T}_{DDS}]\) shall be scheduled using \(\mathbb {S}_1\) and a high priority to avoid communication delays between chk[μk] and chk[μk + 1].

3.3 Multiple latency bands

So far, we have obtained one short band in which the end-to-end latency lies for each chain execution. Also, we can produce more such bands by configuring \(R_{m^\circ }\) with multiple non-overlapping time slots, as shown in Figure 3 Design-3. The length \(\Delta _{m^\circ }\) of each time slot is calculated from Equation 2 or 4 based on how chain’s output is published. For example, \(\Delta _{m^\circ } = \overline{r}_k[\mu _k+1]\) if chk[μk] publishes synchronously. We can place time slots \((sl_{m^\circ ,1}, sl_{m^\circ ,2}, \ldots , sl_{m^\circ ,\nu _{m^\circ }})\) in the chronological order, where \(sl_{m^\circ ,\alpha } = [x_{m^\circ ,\alpha } - \Delta _{m^\circ }, x_{m^\circ ,\alpha })\). We consider \(x_{m^\circ ,1} - \Delta _{m^\circ } \gt \underline{L}_k(\mathbb {S}_1)\), where \(\underline{L}_k(\mathbb {S}_1)\) is the observed minimum end-to-end latency of chk during its high-priority execution. Otherwise, chk may never publish its output in \(sl_{m^\circ ,1}\). We place \(sl_{m^\circ ,\nu _{m^\circ }}\) in the same position as our single-band solution (i.e., following Equation 2 or 4). Hence, we do not change the maximum end-to-end latency of chk.
Such a configuration of \(R_{m^\circ }\) supports a multi-mode implementation of a chain chk where, in each mode, it uses a specific slot in the cycle to publish its output. That is, we get negligible jitters in the end-to-end latency in each operating mode of chk. An autonomous system, e.g., a car, operates in different environments or modes producing largely varying timing behavior of critical computation chains. For example, during a heavy traffic, the end-to-end latency from LIDAR/camera to steering and speed control might be long because of a long computation time in object detection and tracking. System-level performance can improve if specific latency-aware control logic is designed per mode [37, 38] and our design supports that.
We foresee that mode switch conditions are statically defined. We consider that the active mode can be read from a topic by chk[μk + 1][Nd]. It can use the knowledge to ensure that it publishes in an appropriate slot to maintain the end-to-end latency of the mode. Here, we need to put additional logic in chk[μk + 1][Nd]. A prospective implementation of chk[μk + 1][Nd] is briefly discussed in Section 6. The important point is that for our design, we do not need to change the server configurations online during a mode switch. We note that to co-design a multi-mode application logic and its timing design following multi-band latency shaping is a future work.

4 Design Automation

Figure 4:
Figure 4: Lidar-enabled localization in Autonomous Valet Parking using recorded data replayed by rosbag.

4.1 Placement of time slots

In single-band latency shaping (see Section 3.2), the average end-to-end latency \(\tilde{L}_k(\mathbb {S}_2)\) of a chain chk is close to the maximum value \(\overline{L}_k(\mathbb {S}_2)\). It may be significantly higher compared to what we can get during the priority-driven execution of chk. Besides supporting multi-mode application design, our multi-band latency shaping can also improve the average end-to-end latency value compared to a single-band implementation. Consider that the design specification allows to have \(\nu _{m^\circ }\) latency bands for chk. We formulate a mathematical problem to place \(\nu _{m^\circ }\) time slots in \(R_{m^\circ }\). In the process, our goal is to minimize the average end-to-end latency \(\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ })\) of chk using \(\nu _{m^\circ }\) time slots. We choose to minimize \(\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ })\) because it is often used as a performance measure by systems engineers in the industry.
Towards our goal, we profile chk during its high-priority execution. We consider a cumulative probability distribution function ϕ(·) where ϕ(x) gives the measured probability that \(L_k(\mathbb {S}_1) \le x\). We represent each latency band by just one value corresponding to the end of the time slot (e.g., \(x_{m^\circ ,\alpha }\) for \(sl_{m^\circ ,\alpha }\)). The probability that \(x_{m^\circ ,\alpha -1} \lt L_k(\mathbb {S}_1) \le x_{m^\circ ,\alpha }\) is \(\Phi (\alpha -1,\alpha) = \phi (x_{m^\circ ,\alpha }) - \phi (x_{m^\circ ,\alpha -1})\), which is also approximately equal to the probability that the chk will publish its output in \(sl_{m^\circ ,\alpha }\). Hence, we derive an approximate expression to compute \(\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ })\) as follows:
\begin{equation}\tilde{L}_k(\mathbb {S}_2,\nu _{m^\circ }) = \phi (x_{m^\circ ,1}) \cdot x_{m^\circ ,1} + \sum _{\alpha = 2}^{\nu _{m^\circ }} \Phi (\alpha -1,\alpha) \cdot x_{m^\circ ,\alpha }. \end{equation}
(5)
We consider constraints on placing \(sl_{m^\circ ,1}\) (or \(x_{m^\circ ,1}\)) and \(sl_{m^\circ ,\nu _{m^\circ }}\) (or \(x_{m^\circ ,\nu _{m^\circ }}\)), as explained in Section 3.3. We consider that two consecutive slots are non-overlapping, i.e., \(x_{m^\circ ,\alpha - 1} \lt x_{m^\circ ,\alpha } - \Delta _{m^\circ }\).
Further, in ROS2, a liveliness quality-of-service (QoS) [11] can be defined for a topic. When a data item in such a topic becomes older than the specified duration (or QoS) LV it is not processed any further. In that case, we can consider a constraint that the distance between two adjacent bands should be less than the QoS requirement, i.e., \(x_{m^\circ ,\alpha } - x_{m^\circ ,\alpha - 1} \le LV\).
We use particle swarm optimization (PSO) [26] to solve the above constrained minimization problem and obtain the position of latency bands. We use PYSWARMS [32], a Python library for PSO.
Figure 5:
Figure 5: Tool chain for automated latency shaping.
When our multi-band latency shaping is not used in combination with multi-mode application design, we can also adapt our single-band implementation to trade-off between \(\tilde{L}_k(\mathbb {S}_2)\) and \(J_k(\mathbb {S}_2)\). The more we shift \(st_{m^\circ ,1}\) of \(sl_{m^\circ ,1}\) in \(R_{m^\circ }\) towards \(\underline{L}_k(\mathbb {S}_1)\), \(\tilde{L}_k(\mathbb {S}_2)\) reduces while \(J_k(\mathbb {S}_2)\) increases. Hence, our proposed two-server scheduling of a chain enables shaping the end-to-end latency with respect to different timing requirements.

4.2 Tool chain for latency shaping

We develop a tool chain to automate latency shaping for a ROS2 computation chain chk, as depicted in Figure 5. First of all, we offer a Chain Profiler to collect traces and measure the chain’s end-to-end latency (see Section 2.4) when run using SCHED_FIFO and a high priority. Using the measured values provided by the Profiler, Slot Calculator computes the position of the time slots for \(R_{m^\circ }\) offline (see Section 4.1).
For a chain chk that publishes its output synchronously, we need a ROS2 node chk[μk + 1][Nd] with a latency shaping callback chk[μk + 1] extending the chain. For this, we develop a template node with a callback that subscribes to a topic and republishes a consumed data on another topic. We offer a tool, Republisher Editor, to automatically make necessary modifications to extend the chain as follows: (i) It edits the launch file of the original application to remap the topic name of Tk[μk] published by chk[μk], i.e., from Tk[μk][Nm] to \(T_k[\mu _k][Nm]\_tmp\). (ii) It edits the launch file of chk[μk + 1][Nd] and remaps subscribed and published topic names to \(T_k[\mu _k][Nm]\_tmp\) and Tk[μk][Nm], respectively. (iii) It uses a ROS2 API—ros2 topic type—to get the data type Tk[μk][Typ] of the chain’s output. From Tk[μk][Typ], it derives the ROS2 package Tk[μk][pkg] and the header file Tk[μk][Hd] containing the declaration of Tk[μk][Typ]. (iv) It edits CMakeLists.txt and package.xml files in chk[μk + 1][Nd] to link to Tk[μk][pkg]. (v) It edits the source code of chk[μk + 1][Nd] to include Tk[μk][Hd] and declare the type of subscribed and published topics as Tk[μk][Typ]. (vi) It builds chk[μk + 1][Nd] to generate the desired executable.
Further, we offer a tool, Server Configurator, to create and run two servers, Rm and \(R_{m^\circ }\). The Configurator reads files to obtain (i) the measured maximum end-to-end latency of the chain \(\overline{L}_k(\mathbb {S}_1)\) while creating Rm (see Equation 1) and (ii) the time slots produced by the Calculator while creating \(R_{m^\circ }\) (see Section 3.3). Also, we offer a tool, Thread Allocator, to (i) first identify the threads in the ROS2 nodes over which chk spans, i.e., \(ch_k[\mathbb {T}_{ex}]\) and \(ch_k[\mathbb {T}_{DDS}]\), and (ii) then allocate them to Rm and \(R_{m^\circ }\) as per design. The Allocator uses Linux tools (i.e., ps with appropriate arguments) to identify a ROS2 node for a particular thread ID. It uses a LITMUSRT API to allocate a thread to a server. At the end, the components of the chain run using servers following the latency shaping concept.

5 Case Studies

5.1 Autonomous valet parking (AVP)

We study localization in AVP and run its demo from Autoware [5] where the car starts from a predefined position, then drives to a parking spot, and later return back to the initial position. Our workstation has AMD Ryzen Threadripper Pro 3955WX with 16 CPU cores at 3.7 GHz, 64 GB RAM, and an Nvidia RTX2080 GPU. We run ROS2 Foxy over LITMUSRT-patched Linux v5.4.1. We have implemented our techniques for two DDS implementations, i.e., Eclipse Cyclone DDS and eProsima’s Fast DDS. Considering that Cyclone DDS only supports synchronous publish mechanism which is also used in AVP, we present the results using it in this section.
In this demo, a rosbag [27] plays back raw point cloud data recorded from the car’s front and rear Lidars. Although it does not publish data at regular intervals, data are available at almost 10 Hz. To ensure that the localization runs every 100 ms, we introduce a ROS2 node NTh, Throttler, as shown in Figure 4. It has two callbacks, cbrTh and cbfTh, where cbrTh (cbfTh) subscribes to the data from the rear (front) Lidar. Here, cbrTh (or cbfTH) blocks a data item if it has not come after a certain time interval with respect to the previous one, otherwise it republishes the data on a topic TrTh (or TfTh). We put the thread \(N_{Th}[\mathcal {T}_{ex}]\) running these callbacks in a server RTh with a [0, 2] ms time slot repeating every RTh[CT] = 100 ms. Hence, cbrTh (or cbfTh) publishes data every 100 ms and we can consider it to be a timer callback starting a chain following our definition in Section 2.3.
TrTh (or TfTh) is subscribed to by a callback cbrFl (or cbfFl) in a Rear Lidar Filter NrFL (or Front Lidar Filter NfFL) node, see Figure 4. Further, cbrFl and cbfFl publish the filtered point cloud data on topics, TrFl and TfFl, respectively. Lidar Fusion node Nfus runs a synchronizing callback cbfus that subscribes to both TrFl and TfFl and publishes the fused point cloud data on Tfus. Voxel Grid node Nvg runs a callback cbvg that subscribes to Tfus and publishes the downsampled point cloud data on Tds. Finally, NDT Localizer node Nloc runs a callback cbloc to estimate the car’s position based on the subscribed data on Tds. Here, cbloc publishes the estimated pose on Tpos. In our experiments, we apply latency shaping to the chain chrl originating with the acquisition of the rear Lidar data, i.e., it is formed by cbrTh, cbrFl, cbfus, cbvg, and cbloc. The output of chrl is published on Tpos.
Figure 6:
Figure 6: End-to-end latency variation in the chain under different schedule configurations.
Table 2:
cbrThcbfThcbrFlcbfFlcbfuscbvgcbloc
0.60.619.930.73.39.250.8
Table 2: Measured WCETs of callbacks in AVP [ms].
Table 3:
 End-to-end latency 
Scheduler Config.Avg.Max.Jitters
SCHED_FIFO67.484.751.8
1 server per callback72.895.748.6
1 server per chain64.284.451.6
1-band LS86.686.7≈ 0.2
3-band LS (Random)76.486.6≈ 0.2 per band
3-band LS (PSO)71.386.5≈ 0.2 per band
Table 3: Chain’s timing performance with different schedule configurations (all values are in ms).

5.1.1 Using SCHED_FIFO or \(\mathbb {S}_1\).

We configure each executor thread \(N_i[\mathcal {T}_{ex}]\) where Ni ∈ {NrFl, NfFL, Nfus, Nvg, Nloc} to run using \(\mathbb {S}_1\) and a high priority (i.e., 99). We bind \(N_{fFl}[\mathcal {T}_{ex}]\) to run on Core 0 while the others are bound to Core 1, which we maintain for all experiments with AVP. In both Core 0 and Core 1, we also run interfering load using timer callbacks, cbint, 0 and cbint, 1, in nodes, Nint, 0 and Nint, 1, respectively. Each run every 10 ms and has a maximum execution time of 2 ms. We assign a lower priority (i.e., <  99) to the threads \(N_{int,0}[\mathcal {T}_{ex}]\) and \(N_{int,1}[\mathcal {T}_{ex}]\). During the demo run, we measure—using the Chain Profiler—the end-to-end latency for each execution of the chain chrl and the maximum execution time (mWCET) of each callback. We provide mWCETs over multiple runs in Table 2. We show the variations in the end-to-end latency in one run in Figure 6a. We note the average \(\tilde{L}_{rl}(\mathbb {S}_1)\) and the maximum end-to-end latency \(\overline{L}_{rl}(\mathbb {S}_1)\) as well as the jitters \(J_{rl}(\mathbb {S}_1)\) in Table 3 Row 1. We see that chrl experience large jitters, i.e., 51.8 ms, that we want to eliminate using latency shaping. These results help us to configure the servers in the following experiments.
Table 4:
RowConfig.NThNrFlNfusNvgNloc
11 server
per callback
[0,2][2,23][33,37][37,47][47,98]
  RThRm\(R_{m^\circ }\)
RowConfig.NThNrFlNfusNvgNlocNsh
21 server
per chain
[0,2][2,86]
31-band LS[0,2)[2,86][86,87]
4Random
3-band LS
[0,2][2,86][50,51] [70,71] [86,87]
5PSO-computed
3-band LS
[0,2][2,86][62,63] [77,78] [86,87]
Table 4: Reserved time slots to run executor threads of nodes involved in the chain (times are in ms).

5.1.2 Using conventional table-driven reservation or \(\mathbb {S}_2\).

Next, we run chrl following traditional time-triggered scheduling paradigm. We run each thread \(N_i[\mathcal {T}_{ex}]\), where \(N_i \in \lbrace \mathcal {N}_{Th}, \mathcal {N}_{rFl}, \mathcal {N}_{fus}, \mathcal {N}_{vg}, \mathcal {N}_{loc}\rbrace\), in a high-priority server comprising one time slot per 100 ms, which are given in Table 4 Row 1. Here, the time slots are computed based on the callbacks’ mWCET (see Table 2). We also run \(N_{fFl}[\mathcal {T}_{ex}]\) using a high-priority server RfFl with a time slot [2, 33] ms. We see that the server running \(N_{fus}[\mathcal {T}_{ex}]\) starts from 33 ms while that running \(N_{rFl}[\mathcal {T}_{ex}]\) ends at 23 ms. This is because cbfus needs to wait for data from cbfFl, and RfFl ends at 33 ms. We note that AVP localization is a DAG (see Figure 4) and not a linear chain, hence, we must consider such synchronization points while configuring servers. Further, interfering threads \(N_{int,0}[\mathcal {T}_{ex}]\) and \(N_{int,1}[\mathcal {T}_{ex}]\) are put in low-priority servers, Rint, 0 and Rint, 1, each is configured with a time slot spanning the entire cycle. We do not change RfFl, Rint, 0 and Rint, 1 for the remaining experiments with AVP.
Figure 6b shows the chain’s end-to-end latency variation using \(\mathbb {S}_2\). We obtain a maximum end-to-end latency of 95.7 ms (see Table 3 Rows 2), which is 13% longer than \(\overline{L}_{rl}(\mathbb {S}_1)\). This trend is similar to what we have seen for the synthetic chain in Section 2.5. Further, the jitters reduce from \(J_{rl}(\mathbb {S}_1) = 51.8\:\)ms to 48.6 ms like in Section 2.5. However, the reduction here is small because the last callback cbloc contributes the most to the chain’s jitters, which cannot be eliminated using traditional time-triggered scheduling.

5.1.3 Using one reservation server (Section 3.1).

Now, we run the threads \(\lbrace N_{rFl}, N_{fus}, N_{vg}, N_{loc}\rbrace [\mathcal {T}_{ex}]\) in one high-priority server Rm, while we still keep \(N_{Th}[\mathcal {T}_{ex}]\) in RTh so that it behaves as a timer callback. The slots in the servers are given in Table 4 Row 2. We configure Rm based on the measured maximum end-to-end latency \(\overline{L}_{rl}(\mathbb {S}_1)\). As expected, the measured maximum end-to-end latency (i.e., 84.4 ms) is similar to \(\overline{L}_{rl}(\mathbb {S}_1) = 84.7\:\)ms. Also, the jitters of 51.6 ms remain almost the same as \(J_{rl}(\mathbb {S}_1)= 51.8\:\)ms. Further, Figure 6c shows the chain’s end-to-end latency variation.

5.1.4 Single-band latency shaping (LS) (Section 3.2.2).

In Autoware’s implementation of NDT Localizer, Pose Estimate is published synchronously. Hence, we apply latency shaping by extending chrl with cbsh in Nsh as shown in Figure 4. We allocate the threads running the extended chain using three reservation servers RTh, Rm, and \(R_{m^\circ }\), where RTh runs \(N_{Th}[\mathcal {T}_{ex}]\), Rm runs \(\lbrace N_{rFl},N_{fus},N_{vg},N_{loc}\rbrace [\mathcal {T}_{ex}]\), and \(R_{m^\circ }\) runs \(N_{sh}[\mathcal {T}_{ex}]\). The time slots in RTh, Rm, and \(R_{m^\circ }\) are given in Table 4 Row 3. In Figure 6d, we see that the chain’s end-to-end latency under this configuration lies in a short band. We have a maximum value of \(\overline{L}_{rl}(\mathbb {S}_2) = 86.7\:\)ms which is slightly longer than \(\overline{L}_{rl}(\mathbb {S}_1) = 84.7\:\)ms. This is because we have (i) added cbsh and (ii) set time slot boundaries in the order of ms. Also, we get jitters \(J_{rl}(\mathbb {S}_2) \approx 0.2\:\)ms, which is significantly lower compared to \(J_{rl}(\mathbb {S}_1) = 51.8\:\)ms.

5.1.5 Multi-band latency shaping (Section 3.3 and 4.1).

Further, we reconfigure \(R_{m^\circ }\) with 2 more time slots randomly placed, see Table 4 Row 4. Figure 6e shows that this produces three short bands where the chain’s end-to-end latency lies. Table 3 Row 5 shows that the average end-to-end latency is 76.4 ms, which is 13% longer than \(\tilde{L}_{rl}(\mathbb {S}_1)\), while being 12% shorter than our single-band latency shaping. The maximum end-to-end latency and the jitters per band change negligibly with respect to our single-band implementation.
Now, we place the additional time slots using PSO, as explained in Section 4.1. For readers acquainted with PSO, we use 100 particles and a maximum of 150 iterations, while we keep the hyperparameters as: c1 = 0.5;  c2 = 0.3;  w = 0.9. The PSO-computed time slots are provided in Table 4 Row 5 and the obtained latency bands are shown in Figure 6f. In Table 3, we see that the average end-to-end latency reduces to 71.3 ms, i.e., we can improve it by 7% relative to a random placement. Also, it is now only 6% longer than \(\tilde{L}_{rl}(\mathbb {S}_1)\). Even if we do not consider multi-mode application design, the jitters become 25 ms using this configuration, which is less than 50% of \(J_{rl}(\mathbb {S}_1)\), while we increase the maximum and the average end-to-end latency values by only 2% and 6%, respectively. However, we can also just assign one but a longer time slot [62, 87] ms to \(R_{m^\circ }\), which will further reduce the average end-to-end latency without affecting the jitters. This shows that latency shaping allows to explore the trade-off between the average end-to-end latency and the jitters of a computation chain.
Figure 7:
Figure 7: Average end-to-end latency (from PSO) vs number of latency bands.
Figure 8:
Figure 8: End-to-end latency variation for synthetic chain publishing asynchronously.
For chrl, we further study how the average end-to-end latency reduces with more number of bands, as shown in Figure 7. We see that the average end-to-end latency changes negligibly beyond 4 bands.

5.2 Asynchronously-publishing synthetic chain

Figure 9:
Figure 9: Controlling jitters in the end-to-end latency of a ROS2 DAG.
For a proof of concept of our idea in Section 3.2.1, we experiment with a simple synthetic chain chsyn with five callbacks, similar to the setup in Section 2.5. However, we implement chsyn[5] to publish on Tsyn[5] asynchronously. Here, we use eProsima’s Fast DDS because it offers such a publish mechanism as well as to demonstrate that our technique can be applied with any DDS implementation. We allocate \(ch_{syn}[\mathbb {T}_{ex}]\) running the callbacks in chsyn as well as the publisher thread \(ch_{syn}[Nd][\mathcal {T}_{pub}]\) on Core 0. On Core 0, we also allocate an interfering timer callback cbint in Nint, which is dispatched every 10 ms and has a maximum execution time of 2 ms. We create three servers Rm, \(R_{m^\circ }\), and Rm on Core 0 where Rm runs \(\mathcal {T} \in ch_{syn}[\mathbb {T}_{ex}]\), \(R_{m^\circ }\) runs \(ch_{syn}[Nd][\mathcal {T}_{pub}]\), and Rm runs \(N_{int}[\mathcal {T}_{ex}]\). Further, Rm and \(R_{m^\circ }\) are high-priority servers enabling latency shaping, while Rm is a low-priority server. Time slots for the servers are defined as follows: slm, 1 = [0, 59] ms, \(sl_{m^\circ ,1}=\:[59,60]\:\)ms, and slm, 1 = [0, 100] ms. Using this configuration, the end-to-end latency \(L_{syn}(\mathbb {S}_2)\) of chsyn lies in a short band as shown in Figure 8 where \(\overline{L}_{syn}(\mathbb {S}_2)= 59.6\:\)ms while \(J_{syn}(\mathbb {S}_2)\approx 0.2\:\)ms. The figure also shows the variation in \(L_{syn}(\mathbb {S}_1)\) of chsyn when it is scheduled using \(\mathbb {S}_1\) and a high priority. We also measure \(\overline{L}_{syn}(\mathbb {S}_1) = 58.5\:\)ms and \(J_{syn}(\mathbb {S}_1)=33\:\)ms. Again, latency shaping produces negligible jitters with a minimal increase (≈ 2%) in the maximum end-to-end chain latency.

5.3 Comparison with LET

Using servers, we can implement callbacks following LET. (i) When chk[l] publishes asynchronously, we put the associated publisher thread \(ch_k[l][Nd][\mathcal {T}_{pub}]\) in a high-priority server with a time slot placed at the beginning of the period. This will ensure that DDS communication takes place at the beginning of the period. We can say that the callback’s response time is equal to its period. (ii) When chk[l] publishes its output synchronously, we need to insert another callback in the chain that reads data from Tk[l] and republishes it—similar to a latency-shaping callback chk[μk + 1] in Section 3.2.2. We can then run the inserted callback using a server with a time slot placed at the beginning of the period.
Table 5:
 Measured maximum end-to-end latency [in ms]
Chain1-band shapingLET per callbackLET per chain
Synthetic59.2500100
AVP86.7400 → simulation fails100
Table 5: Comparison with LET.
Now, we apply LET to each callback in a synthetic chain with five callbacks and a period of 100 ms—as described in Section 2.5. Here, each callback publishes synchronously. We put a maximum interfering load of 2 ms per 10 ms as described in Section 5.2. As expected, we get a fairly constant end-to-end latency around 500 ms.
When the end-to-end latency of a chain is less than the period of its execution, we can consider the whole chain as one task. Hence, using our latency shaping concept, we can apply LET for the whole chain by just configuring \(R_{m^\circ }\) to comprise a time slot at the beginning of the period. In this case, we shall get a constant end-to-end latency of the chain that is equal to one period. Now, we apply LET to the aforementioned synthetic chain and get an end-to-end latency of 100 ms.
Further, we apply LET to chrl in AVP and get a fairly constant end-to-end latency around 100 ms, which is 15% longer than what we have obtained using latency shaping. We note that when we apply LET to each callback in chrl, the end-to-end latency becomes too long and the simulation does not run properly.
Table 5 provides the end-to-end latency values obtained using our single-band latency shaping and the LET implementations for both synthetic and AVP chains. It demonstrates that we can flexibly keep the end-to-end latency shorter than a conventional LET implementation if the chained computations always finish within a period.

5.4 Controlling jitters of a chain in a ROS2 DAG

As mentioned in Section 2.3, we do not consider DAG schedule optimization. However, given a partitioned optimal schedule of a DAG (where each callback runs on a specific processor core), we can apply our mechanism to control jitters of a chain in the DAG with a negligible increase in its maximum observed end-to-end latency. For a proof of concept, we first apply a recent schedule optimization technique [43] on an input DAG. It is based on deep reinforcement learning and minimizes the number of cores required to meet the worst-case DAG timing requirement. In the process, it can compute the core assignment for callbacks while also adding more edges in the DAG (or precedence relations between callbacks).
Figure 9a shows a transitive reduction (for a clear visualization) of such an output DAG. Each of the 18 vertices represents a callback. A vertex border gives the core assignment, i.e., we need two cores. This DAG executes every 764 ms. WCET of each callback is shown and we assume that the execution time varies uniformly between 0.5 × WCET and WCET. Now, we develop a synthetic ROS2 application with 18 nodes, each has one callback. Each callback publishes on a topic. Two outgoing edges from a vertex imply multiple subscribers of the topic. We implement vertex 0 as a timer callback (period 764 ms) while all other vertices are subscriber callbacks. We implement a vertex with two or more incoming edges as a synchronizing (subscriber) callback. We put computational load on each callback according to the the assumed execution time variation for it. We run this application following the core assignment in Figure 9a and using \(\mathbb {S}_1\) with a high priority. We measure the DAG’s end-to-end latency—the time between the start of callback 0 and when callback 1 publishes its output—and its variation is shown in Figure 9b. We observe a maximum value of 450 ms.
Now, we configure two high-priority servers Rm0 and Rm1 to run on Core 0 and Core 1, respectively, while each comprises a slot [0, 450] ms. The threads running the callbacks in Core 0 (Core 1) are assigned to Rm0 (Rm1). The output topic name of callback 1 is remapped from T[1][Nm] to \(T[1][Nm]\_tmp\). We add a node Nsh with a callback cbsh with subscribed topic name \(T[1][Nm]\_tmp\) and published topic name T[1][Nm]. We put another high-priority server \(R_{m^\circ }\) on Core 1 with a time slot [450, 451). We map the thread \(N_{sh}[\mathcal {T}_{ex}]\) to \(R_{m^\circ }\). Further, we run the latency-shaped DAG application and record an almost constant end-to-end latency, see Figure 9b. We note that its analytical worst-case end-to-end latency is 560 ms (based on the WCETs), while our measurement-based approach produces ≈ 450 ms.
This DAG has one sink vertex. For a DAG with multiple sinks, we can add a latency-shaping node per sink vertex and schedule it using an appropriate server. That is, we can perform latency shaping for multiple chains in a DAG. However, we need a mechanism to first co-optimize their theoretical worst-case end-to-end latency values.

6 Future Considerations

In our problem setting as explained in Section 2.3, a critical chain may experience interference from a non-critical callback run by an executor thread that also runs a callback in the chain. For example, in Section 5.1, the Throttler node NTh has two callbacks, cbrTh and cbfTh, while only cbrTh is in the critical chain chrl. Both are run by \(N_{Th}[\mathcal {T}_{ex}]\) in RTh and RTh[Sl] is designed to sufficiently accommodate both callbacks every cycle. For more precise control over each callback’s execution in a node, we suggest developing it using multiple mutually exclusive callback groups and assigning each group to a different executor using the add_callback_group(…) function [36].
Further, this paper does not study the case when the chain needs to run longer than the slot duration in Rm, which will delay the next chain execution. We can adapt our design to eliminate such delays when the chain’s worst-case end-to-end latency is shorter than its period. That is, we can extend slm, 1 to cover the entire period so that the chain will finish its computation within a period. However, the chain’s output is not published in the same period if the corresponding \(sl_{m^\circ ,1}\) is missed. We can add an appropriate liveliness QoS for Tk[μk] so that it is not processed in the next period. In this case, we discard a chain execution if it overruns the estimated maximum end-to-end latency. Several works [14, 24, 47, 48] have analyzed system safety—in terms of control stability and performance—when control updates may be skipped. In Section 3.1, we also commented on the case where the worst-case end-to-end latency is longer than the period.
While we have demonstrated how latency shaping can be combined with DAG scheduling, a better formalization of the application model and solution is future work. Also, when an executor thread runs multiple callbacks in a DAG, and it shall be statically bound to a core, partitioned DAG scheduling must consider that these callbacks run on one processing unit, which is similar to [42].
While we can support multi-mode application design, we have not formalized the logic inside a latency-shaping node chk[μk + 1][Nd] to ensure a constant end-to-end latency in a mode. First, chk[μk + 1] must be aware of the active mode, hence, another callback in chk[μk + 1][Nd] subscribes to the corresponding topic and passes the information via a shared variable. Further, chk[μk + 1] must track in which slot it is running which is possible using a timer callback in chk[μk + 1][Nd] that updates a counter in each slot in \(R_{m^\circ }\) while sharing the value with chk[μk + 1]. By being aware of the active mode and slot, chk[μk + 1] can decide appropriately to publish or discard the output or wait for the next slot. Here, we do not dynamically update \(R_{m^\circ }\).

7 Related Work

7.1 Methods to control timing variations

Several works in control theory have studied the impact of software and network timings on control performance. JITTERBUG [31] and JITTERTIME [14] estimate control performance in non-ideal timing conditions, e.g., jitters, execution overruns, dropped samples, and random sampling. It is widely accepted that the performance of a control application degrades with increasing sensing-to-actuation delay jitters, see [23] for an example. Hence, in the real-time systems community, mechanisms have been proposed to reduce jitters.
A popular choice for this purpose is to implement the logical execution time (LET) concept [25], which states that a task reads its inputs at the beginning of the period and produces the output at the end of the period, i.e., the response time of a task is exactly equal to one period. Towards an LET implementation, Biondi et al. in [7] schedule high-priority tasks at the beginning of the period to copy data (i) from the producer’s local memory to the global memory and (ii) from the global memory to the consumer’s local memory. Similarly, Pazzaglia et al. in [33, 34] use DMA to accelerate data transfers between tasks in different processing cores following LET paradigm. Further, in [18, 21, 22], a system-level LET concept is introduced for distributed applications in which an interconnect task exchanges data between two processors in a non-negligible yet constant time. However, existing LET implementations cannot be applied trivially to ROS2 chains as discussed in Section 1. Also, our proposed two-server scheduling mechanism allows a more flexible control over the end-to-end timings of such chains, in particular, can produce multiple short latency bands.
Further, Buttazzo and Cervin have proposed three more jitter-control mechanisms [12]: task splitting, advancing deadlines, and non-preemption. The latter two mechanisms are not effective when the tasks have large variations in the execution time. The first method proposes to split a task into several parts and schedule the first sub-task at the beginning of the period and the last sub-task at the end of the period, which is similar to an LET implementation while also requiring to modify a task implementation. However, our mechanism does not require to modify or recompile the application source code.

7.2 Timing analysis and optimization for ROS2

There have been efforts to study the real-time behavior of ROS2 applications in recent years. Casini et al. in [13] have outlined how ROS2 executors schedule callbacks in a node. Also, it has been pointed out that applications contain chains of callbacks and a compositional performance analysis (CPA) technique has been employed to bound the chains’ end-to-end latency when each ROS2 executor uses a constant-bandwidth server (CBS) to run callbacks. Later, Tang et al. have improved the precision of the worst-case end-to-end latency analysis [44] for the same system model. Blaß et al. in [8] have improved the results by further (i) reducing the pessimism in the callback activation model, (ii) considering a cumulative processor demand function for each callback, and (iii) making the timing analysis aware of the starvation freedom offered by ROS2 executors.
Previous works have also studied different scheduling policies in a ROS2 environment. To meet the end-to-end timing requirements, heterogeneous laxity-based DAG scheduling coupled with static priority assignment for ROS2 nodes have been studied [40]. Yang and Azumi have evaluated a callback-group-level ROS2 executor in [50] where a callback is assigned to an executor based on its real-time requirement (i.e., time-critical or best-effort). PiCAS [15] minimizes the chains’ end-to-end latency by statically prioritizing the execution of their callbacks. Also, it (i) assigns priorities to callbacks, (ii) allocates nodes to executors, (iii) maps executors to processing cores, and (iv) analyzes worst-case end-to-end latency. While the above works have considered static scheduling, Blaß et al. in [9] have dynamically refined the CBS budgets for ROS2 executors based on online timing measurements. Al Arafat et al. have proposed deadline-driven dynamic priority assignment to ROS2 chains as well as a worst-case end-to-end latency analysis technique [3]. Most of these works have mentioned that ROS2 chains often implement sense-compute-actuate control logic. However, all of them have focused only on the worst-case end-to-end latency and none of them have considered to minimize the jitters, which is the main focus of our work.

8 Concluding Remarks

This paper introduces the concept of latency shaping that primarily enables low-jitters implementation of a ROS2 computation chain. Compared to LET, it reduces pessimism as well as supports a multi-mode implementation of a chain. It is more practical because it uses the profiling results of the chain instead of relying on analytical frameworks. We further show how the concept can be implemented considering different mechanisms to publish data in ROS2. An important aspect of our proposed idea is that it does not require to modify or recompile the application code, which is crucial requirement to preserve the separation of concerns between application development and timing engineering in the industry.
While, in this work, we have considered static allocation of threads to reservation servers, in the future, we intend to explore dynamic allocation for the last callback in the chain so that we can split the computation and data publish tasks of the thread into two servers, thereby eliminating the use of an additional latency-shaping callback. It would be also interesting to explore co-design of control logic and latency shaping. While, we have considered computation chains over ROS2, DDS, and Linux, we believe that the idea is generic and can be applied to other middlewares and operating systems as well.

References

[1]
H. Abaza, D. Roy, S. Fan, S. Saidi, and A. Motakis. 2024. Trace-enabled Timing Model Synthesis for ROS2-based Autonomous Applications. In Design, Automation & Test in Europe Conference & Exhibition (DATE). https://arxiv.org/abs/2311.13333
[2]
Benny Akesson, Mitra Nasri, Geoffrey Nelissen, Sebastian Altmeyer, and Robert I. Davis. 2022. A Comprehensive Survey of Industry Practice in Real-Time Systems. Real-Time Syst. 58, 3 (2022), 358–398.
[3]
Abdullah Al Arafat, Sudharsan Vaidhun, Kurt M. Wilson, Jinghao Sun, and Zhishan Guo. 2022. Response Time Analysis for Dynamic Priority Scheduling in ROS2. In ACM/IEEE Design Automation Conference (DAC).
[6]
Christophe Bédard, Ingo Lütkebohle, and Michel Dagenais. 2022. ros2_tracing: Multipurpose Low-Overhead Framework for Real-Time Tracing of ROS 2. IEEE Robotics and Automation Letters 7, 3 (2022), 6511–6518.
[7]
Alessandro Biondi and Marco Di Natale. 2018. Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
[8]
Tobias Blaß, Daniel Casini, Sergey Bozhko, and Björn B. Brandenburg. 2021. A ROS 2 Response-Time Analysis Exploiting Starvation Freedom and Execution-Time Variance. In IEEE Real-Time Systems Symposium (RTSS).
[9]
Tobias Blass, Arne Hamann, Ralph Lange, Dirk Ziegenbein, and Björn B Brandenburg. 2021. Automatic latency management for ROS 2: Benefits, challenges, and open problems. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
[10]
B. B. Brandenburg, M. Gül, and M. Vanga. 2017. A tour of LITMUS\(^\text{RT}\). https://www.litmus-rt.org/tutorial/manual.html
[11]
Nick Burek. 2019. ROS QoS - Deadline, Liveliness, and Lifespan. https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html
[12]
G. C. Buttazzo and A. Cervin. 2007. Comparative Assessment and Evaluation of Jitter Control Methods. In International Conference on Real-Time and Network Systems (RTNS).
[13]
Daniel Casini, Tobias Blaß, Ingo Lütkebohle, and Björn Brandenburg. 2019. Response-time analysis of ROS 2 processing chains under reservation-based scheduling. In 31st Euromicro Conference on Real-Time Systems. Schloss Dagstuhl.
[14]
Anton Cervin, Paolo Pazzaglia, Mohammadreza Barzegaran, and Rouhollah Mahfouzi. 2019. Using JitterTime to Analyze Transient Performance in Adaptive and Reconfigurable Control Systems. In IEEE International Conference on Emerging Technologies and Factory Automation (ETFA).
[15]
Hyunjong Choi, Yecheng Xiang, and Hyoseung Kim. 2021. PiCAS: New design of priority-driven chain-aware scheduling for ROS2. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
[18]
Rolf Ernst, Leonie Ahrendts, and Kai-Björn Gemlau. 2018. System Level LET: Mastering Cause-Effect Chains in Distributed Systems. In Conference of the IEEE Industrial Electronics Society (IECON).
[19]
Eclipse Foundation. 2022. A guide to the configuration options of Eclipse Cyclone DDS. https://cyclonedds.io/docs/cyclonedds/0.9.1/config.html
[20]
[21]
Kai-Björn Gemlau, Leonie KÖHLER, Rolf Ernst, and Sophie Quinton. 2021. System-Level Logical Execution Time: Augmenting the Logical Execution Time Paradigm for Distributed Real-Time Automotive Software. ACM Trans. Cyber-Phys. Syst. 5, 2, Article 14 (2021), 27 pages.
[22]
Kai-Björn Gemlau, Leonie Köhler, and Rolf Ernst. 2021. Efficient Run-Time Environments for System-Level LET Programming. In Design, Automation & Test in Europe Conference & Exhibition (DATE).
[23]
Sumana Ghosh, Arnab Mondal, Debayan Roy, Philipp H. Kindt, Soumyajit Dey, and Samarjit Chakraborty. 2021. Proactive feedback for networked CPS. In Annual ACM Symposium on Applied Computing (SAC).
[24]
Dip Goswami, Reinhard Schneider, and Samarjit Chakraborty. 2014. Relaxing Signal Delay Constraints in Distributed Embedded Controllers. IEEE Transactions on Control Systems Technology 22, 6 (2014), 2337–2345.
[25]
T.A. Henzinger, B. Horowitz, and C.M. Kirsch. 2003. Giotto: a time-triggered language for embedded programming. Proc. IEEE 91, 1 (2003), 84–99.
[26]
J. Kennedy and R. Eberhart. 1995. Particle swarm optimization. In International Conference on Neural Networks (ICNN).
[28]
Tobias Kronauer, Joshwa Pohlmann, Maximilian Matthé, Till Smejkal, and Gerhard Fettweis. 2021. Latency Analysis of ROS2 Multi-Node Systems. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI).
[29]
Takahisa Kuboichi, Atsushi Hasegawa, Bo Peng, Keita Miura, Kenji Funaoka, Shinpei Kato, and Takuya Azumi. 2022. CARET: Chain-Aware ROS 2 Evaluation Tool. In IEEE International Conference on Embedded and Ubiquitous Computing (EUC).
[30]
Zihang Li, Atsushi Hasegawa, and Takuya Azumi. 2022. Autoware_Perf: A Tracing and Performance Analysis Framework for ROS 2 Applications. J. Syst. Archit. 123, C (2022), 10 pages.
[31]
B. Lincoln and A. Cervin. 2002. JITTERBUG: A tool for analysis of real-time control performance. In IEEE Conference on Decision and Control (CDC).
[32]
Lester James V. Miranda. 2017. PYSWARMS: A research toolkit for particle swarm optimization in Python. https://pyswarms.readthedocs.io/en/latest/
[33]
Paolo Pazzaglia, Daniel Casini, Alessandro Biondi, and Marco Di Natale. 2021. Optimal Memory Allocation and Scheduling for DMA Data Transfers under the LET Paradigm. In ACM/IEEE Design Automation Conference (DAC).
[34]
Paolo Pazzaglia, Daniel Casini, Alessandro Biondi, and Marco Di Natale. 2023. Optimizing Inter-Core Communications Under the LET Paradigm using DMA Engines. IEEE Trans. Comput. 72, 1 (2023), 127–139.
[35]
[37]
Debayan Roy, Wanli Chang, Sanjoy K. Mitter, and Samarjit Chakraborty. 2019. Tighter Dimensioning of Heterogeneous Multi-Resource Autonomous CPS with Control Performance Guarantees. In Design Automation Conference (DAC).
[38]
Debayan Roy, Sumana Ghosh, Qi Zhu, Marco Caccamo, and Samarjit Chakraborty. 2020. GoodSpread: Criticality-Aware Static Scheduling of CPS with Multi-QoS Resources. In IEEE Real-Time Systems Symposium (RTSS).
[39]
Debayan Roy, Licong Zhang, Wanli Chang, Sanjoy K. Mitter, and Samarjit Chakraborty. 2018. Semantics-Preserving Cosynthesis of Cyber-Physical Systems. Proc. IEEE 106 (2018), 171–200.
[40]
Yukihiro Saito, Futoshi Sato, Takuya Azumi, Shinpei Kato, and Nobuhiko Nishio. 2018. ROSCH: Real-time scheduling framework for ROS. In IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA).
[41]
Gerlando Sciangula, Daniel Casini, Alessandro Biondi, Claudio Scordino, and Marco Di Natale. 2023. Bounding the data-delivery latency of DDS messages in real-time applications. In Euromicro Conference on Real-Time Systems (ECRTS)(Leibniz International Proceedings in Informatics (LIPIcs)).
[42]
J. Shi, M. Günzel, N. Ueter, G. von der Brüggen, and J.-J. Chen. 2024. DAG Scheduling with Execution Groups. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
[43]
Binqi Sun, Mirco Theile, Ziyuan Qin, Daniele Bernardini, Debayan Roy, Andrea Bastoni, and Marco Caccamo. 2024. Edge Generation Scheduling for DAG Tasks using Deep Reinforcement Learning. IEEE Trans. Comput. (2024).
[44]
Yue Tang, Zhiwei Feng, Nan Guan, Xu Jiang, Mingsong Lv, Qingxu Deng, and Wang Yi. 2020. Response Time Analysis and Priority Assignment of Processing Chains on ROS2 Executors. In IEEE Real-Time Systems Symposium (RTSS).
[45]
Harun Teper, Mario Günzel, Niklas Ueter, Georg von der Brüggen, and Jian-Jia Chen. 2022. End-To-End Timing Analysis in ROS2. In IEEE Real-Time Systems Symposium (RTSS).
[46]
Manohar Vanga, Andrea Bastoni, Henrik Theiling, and Björn B. Brandenburg. 2017. Supporting Low-Latency, Low-Criticality Tasks in a Certified Mixed-Criticality OS. In International Conference on Real-Time Networks and Systems (RTNS).
[47]
Nils Vreman and Martina Maggio. 2023. Stochastic Analysis of Control Systems Subject to Communication and Computation Faults. ACM Trans. Embed. Comput. Syst. 22, 5s, Article 144 (sep 2023), 25 pages.
[48]
Nils Vreman, Paolo Pazzaglia, Victor Magron, Jie Wang, and Martina Maggio. 2022. Stability of Linear Systems Under Extended Weakly-Hard Constraints. IEEE Control Systems Letters 6 (2022), 2900–2905.
[49]
Lucas Wendland. 2024. ros2/message_filters. https://github.com/ros2/message_filters/tree/rolling
[50]
Yuqing Yang and Takuya Azumi. 2020. Exploring real-time executor on ROS 2. In IEEE International Conference on Embedded Software and Systems (ICESS).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
RTNS '24: Proceedings of the 32nd International Conference on Real-Time Networks and Systems
November 2024
326 pages
ISBN:9798400717246
DOI:10.1145/3696355

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 January 2025

Check for updates

Qualifiers

  • Research-article

Conference

RTNS 2024

Acceptance Rates

Overall Acceptance Rate 119 of 255 submissions, 47%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 171
    Total Downloads
  • Downloads (Last 12 months)171
  • Downloads (Last 6 weeks)82
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media