Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design

Yiyang Huang ICT, Chinese Academy of SciencesChina Yuhui Hao ICT, Chinese Academy of SciencesChina Bo Yu Shenzhen Institute of Artificial Intelligence and Robotics for SocietyChina Feng Yan MeituanChina Yuxin Yang ICT, Chinese Academy of SciencesChina Feng Min ICT, Chinese Academy of SciencesChina Yinhe Han ICT, Chinese Academy of SciencesChina Lin Ma MeituanChina Shaoshan Liu Shenzhen Institute of Artificial Intelligence and Robotics for SocietyChina Qiang Liu Tianjin UniversityChina  and  Yiming Gan ICT, Chinese Academy of SciencesChina
Abstract.

Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today’s computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0×8.0\times8.0 ×, resulting in up to 3.6×3.6\times3.6 × speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. https://github.com/hyy0613/Corki

1. Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and long-term task planning (RT1, ; RT2, ; driess2023palm, ; huang2023voxposer, ; duan2022survey, ; song2023llm, ; liu2024robouniviewvisuallanguagemodelunified, ). Building upon the success of LLMs, the field of embodied AI, which employs LLMs to control robots interacting with the physical world, is increasingly recognized as a promising step towards achieving Artificial General Intelligence (AGI).

The single most important difference between using LLMs for generating text and images versus integrating them as decision-making and planning modules within robotic pipelines lies in the hard real-time constraints imposed on robots (mei2006deployment, ; khatib1986real, ). Without real-time assurances, the applicability of embodied AI systems is severely limited to theoretical studies rather than real-world applications.

However, current embodied AI systems struggle to meet real-time constraints. The fundamental reason lies in the execution model of embodied AI systems. To date, all embodied AI systems follow a sequential execution model that processes video input and generates robot actions in a frame-by-frame basis. Specifically, after warming up, the robots will start with a video sequence containing N𝑁Nitalic_N frames and a language instruction i𝑖iitalic_i. The LLM will predict the robot action tuple (Δx,Δy,Δz,)Δ𝑥Δ𝑦Δ𝑧(\Delta{x},\Delta{y},\Delta{z},...)( roman_Δ italic_x , roman_Δ italic_y , roman_Δ italic_z , … ) based on the current input tuple (FrametN,FrametN+1,,Framet,i)𝐹𝑟𝑎𝑚subscript𝑒𝑡𝑁𝐹𝑟𝑎𝑚subscript𝑒𝑡𝑁1𝐹𝑟𝑎𝑚subscript𝑒𝑡𝑖(Frame_{t-N},Frame_{t-N+1},...,Frame_{t},i)( italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t - italic_N + 1 end_POSTSUBSCRIPT , … , italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_i ), where ΔΔ\Deltaroman_Δ denotes the proposed robot movements and Framet𝐹𝑟𝑎𝑚subscript𝑒𝑡Frame_{t}italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents images within the video sequence. Upon executing the action, the robot captures a subsequent frame Framet+1𝐹𝑟𝑎𝑚subscript𝑒𝑡1Frame_{t+1}italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT at the latest position. The next LLM inference then involves processing the updated input tuple (FrametN+1,FrametN+2,,Framet+1,i)𝐹𝑟𝑎𝑚subscript𝑒𝑡𝑁1𝐹𝑟𝑎𝑚subscript𝑒𝑡𝑁2𝐹𝑟𝑎𝑚subscript𝑒𝑡1𝑖(Frame_{t-N+1},Frame_{t-N+2},...,Frame_{t+1},i)( italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t - italic_N + 1 end_POSTSUBSCRIPT , italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t - italic_N + 2 end_POSTSUBSCRIPT , … , italic_F italic_r italic_a italic_m italic_e start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_i ).

The current execution model is time-consuming due to two primary reasons. First, the sequential nature of each stage significantly contributes to the overall latency. Since most robots depend on high-end servers for LLM inference, the latency associated with the embodied AI systems is the cumulative effect of three distinct stages: LLM inference latency, robot action execution latency, and data communication latency. The sum of these latencies for each frame can add up to several hundred milliseconds. Second, all three stages have to happen for every frame, which further hurts the real-time constraints. We show this pipeline in Fig. 1a.

Refer to caption
(a) Current discrete execution pipeline of embodied AI systems, where every time a single next step action is predicted and all three stages happen for every frame.
Refer to caption
(b) Proposed continuous execution pipeline of embodied AI systems, where model predicts near future trajectory and pipelines communication latency with robot execution latency.
Fig. 1. Existing embodied AI systems pipeline and Corki pipeline.

Idea.

Today’s embodied AI pipeline is designed purely based on the convenience of algorithm designers as executing frame by frame sequentially is a traditional method in video processing algorithms, yet it violates a basic principle of robotic software design. That is, the front-end, responsible for perception and planning, does not inherently require real-time performance, whereas the back-end, which includes robot control algorithms, must operate in real-time. Critically, the unbalanced frequency requirements existing in robotic software stack allow us to decouple LLM inference, robotic control and data communication. After decoupling, we are able to reduce the front-end LLM inference rate, pipelining three stages and accelerating robotic control algorithms to achieve real-time performance in embodied AI applications.

Design.

In this paper, we fundamentally change the execution pipeline of existing embodied AI systems to reduce the end-to-end latency. Firstly, at algorithm level, we depart from the conventional approach of predicting robot movement in the next frame discretely. Instead, we propose a novel embodied AI algorithm framework that is able to predict the trajectory of the robot for the near future. Unlike methods that focus on predicting only the immediate subsequent step, our algorithm accurately forecasts actions for multiple future steps. Thus, we significantly reduce the inference frequency of LLMs, saving both latency and energy.

Secondly, to accelerate the control process, we devise an accelerator capable of translating the trajectories predicted by LLMs into seamless and real-time control signals. The accelerator we design is tailored for task space computed torque control, with a customized data-flow accelerator, customized circuits and dedicated on-chip buffer design.

Finally, at the system level, we streamline the transmission of newly captured frames to the server concurrently with the robot execution process. This approach effectively hides communication latency beneath the robot execution latency, resulting in a further reduction of the end-to-end latency. We illustrate our idea with Fig. 1b.

Results.

We use a state-of-the-art embodied AI system, RoboFlamingo (li2023vision, ), as our baseline, and apply Corki on top of it. We show that Corki is able to achieve 3.6×3.6\times3.6 × speed up. The maximum success rate improvement is 17.3% higher than the baseline. The contribution of this paper is summarized as follows:

  • We observe that the existing embodied AI pipeline can not satisfy real-time constraints because of the sequential execution pipeline and balanced frequency of front-end video capture and back-end robot control. The high frequency of LLM inference and latency accumulation of every stage result in high latency.

  • We propose a new embodied AI algorithm framework to control robots by predicting future trajectories instead of the discrete movement of every frame.

  • We design an accelerator to smoothly transform the trajectory predicted by our models into robotic control signals in real time.

  • We design a new execution pipeline based on our proposed framework and hardware accelerator to hide communication latency between robot body and server.

  • We demonstrate Corki with an efficient implementation of the proposed embodied AI system. We show that Corki is able to significantly reduce the end-to-end latency without sacrificing accuracy.

We organize our paper as follows. Sec. 2 introduces basic embodied AI system pipeline and motivates our paper. Sec. 3 introduces a new embodied AI algorithm framework that is able to predict continuous near-future trajectory of robots. Sec. 4 describes the proposed hardware accelerator for controlling robots given predicted trajectory and system pipeline design. Sec. 5 discusses the experimental methodology, followed by the evaluation results in Sec. 6. We discuss the related work in Sec. 7 and conclude our paper in Sec. 8.

2. Background and Motivation

We introduce the background of embodied AI systems (Sec. 2.1). We show that the execution pipeline of embodied AI systems is significantly different from previous utilization of LLMs and results in high end-to-end frame latency (Sec. 2.2).

2.1. Embodied AI System

Traditional robots typically depend on rule-based algorithms for decision-making and task planning, confining their utility to simple and predetermined scenarios. In contrast, the success of Large Language Models (LLMs) has spurred efforts to equip robots with advanced capabilities in reasoning and long-term planning. Such success boosts the emergence of applications that use LLMs for robot control, which has demonstrated notable advancements, particularly in enhancing the success rates of robots performing complex tasks in dynamic scenarios (li2023vision, ; wake2023gpt, ; hu2023toward, ; firoozi2023foundation, ).

Embodied AI systems represent a category of systems that leverage the reasoning abilities of Large Language Models (LLMs) to guide robots in accomplishing complex real-world tasks, including but not limited to housekeeping and industrial manufacturing, with the goal of reducing human efforts. Typically, these systems comprise two integral components: a high-end server equipped with GPUs for LLM inference and a robot body responsible for executing and interacting with the physical environment.

Embodied AI systems commonly employ a multi-modality Large Language Model  (lyu2023macaw, ; zhao2023bubogpt, ; mu2024embodiedgpt, ; huang2023chatgpt, ) as the planning module. This LLM seamlessly integrates language instruction inputs, such as ”put the blue mug on the table and bring me the red one,” with traditional sensor inputs in the robotic pipeline, including continuous videos, IMU signals, and point clouds (everett1995sensors, ; li2019common, ; santaera2015low, ). The LLM inference will generate the next actions for the robot body to perform based on current and recent observations along with the instructions.

Recently, embodied AI systems have demonstrated substantial potential to replace humans in various tasks. Google’s robotic transformer (RT1, ; RT2, ) has achieved an impressive success rate of over 75.0% on tasks including ”pick up objects”, ”open drawers”, and ”place objects into designated places” within domestic environments. RoboFlamingo (li2023vision, ), a recently proposed embodied AI framework, further elevates the success rate of a single task to over 89.5%.

2.2. Execution Pipeline and Performance Bottleneck

We use RoboFlamingo as an example of embodied AI systems to illustrate existing system pipelines. RoboFlamingo utilizes a vision-language model (VLM) to control a Franka Emika Panda robot arm with a parallel gripper (gaz2019dynamic, ), which in total has seven degrees of freedom. RoboFlamingo takes two forms of input, a language instruction and a video sequences containing 12 images. The model will predict the action of the robot arm’s end-effector within the next step. Equ. 1 describes the LLM inference process at frame t𝑡titalic_t, where Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents a single frame within a video sequence and i𝑖iitalic_i denotes the language instruction. Δx,Δy,ΔzΔ𝑥Δ𝑦Δ𝑧\Delta x,\Delta y,\Delta zroman_Δ italic_x , roman_Δ italic_y , roman_Δ italic_z are the three-dimensional position change, Δα,Δβ,ΔγΔ𝛼Δ𝛽Δ𝛾\Delta\alpha,\Delta\beta,\Delta\gammaroman_Δ italic_α , roman_Δ italic_β , roman_Δ italic_γ are the three-dimensional rotation change and g𝑔gitalic_g is the one-dimensional gripper status, which can be either open or closed.

(1) (Δx,Δy,Δz,Δα,Δβ,Δγ,g)=LLM(Ft11,Ft10,,Ft,i)Δ𝑥Δ𝑦Δ𝑧Δ𝛼Δ𝛽Δ𝛾𝑔𝐿𝐿𝑀subscript𝐹𝑡11subscript𝐹𝑡10subscript𝐹𝑡𝑖\displaystyle(\Delta x,\Delta y,\Delta z,\Delta\alpha,\Delta\beta,\Delta\gamma% ,g)=LLM(F_{t-11},F_{t-10},...,F_{t},i)( roman_Δ italic_x , roman_Δ italic_y , roman_Δ italic_z , roman_Δ italic_α , roman_Δ italic_β , roman_Δ italic_γ , italic_g ) = italic_L italic_L italic_M ( italic_F start_POSTSUBSCRIPT italic_t - 11 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 10 end_POSTSUBSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_i )

After the model predicts the action, the robot arm will perform the action, moving itself to a new position. The control process on the robot translates movement information of the end-effector to the actual torque of each motor placed on the joints of the robot arm. A camera on the robot, usually placed on the gripper, will capture a new frame Ft+1subscript𝐹𝑡1F_{t+1}italic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and send it back to the model to update the input frames. The next inference will happen on (Ft10,Ft9,,Ft+1,i)subscript𝐹𝑡10subscript𝐹𝑡9subscript𝐹𝑡1𝑖(F_{t-10},F_{t-9},...,F_{t+1},i)( italic_F start_POSTSUBSCRIPT italic_t - 10 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 9 end_POSTSUBSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_i ).

Refer to caption
(a) Per-frame latency breakdown.
Refer to caption
(b) Per-frame energy breakdown.
Fig. 2. Latency and energy breakdown of RoboFlamingo framework.

Specifically, we analyze and characterize the execution pipeline of RoboFlamingo by breaking down the execution latency and show the results in Fig. 2. To get the results, we run LLM inference on a Nvidia V100 GPU, the robot control on an Intel 13th generation i7-13700 CPU and gather the data communication data using a Wi-Fi module.

Fig. 2a shows that even with a relatively small LLM (3 billion parameters) and a high-end GPU, the end-to-end frame latency of the embodied AI system can reach up to 249.4 ms, which directly contributes to a very low frame rate that does not satisfy real-time constraints. Among all three stages, LLM inference takes 76.9% of the execution time, robot control takes 4.1% and data communication takes 19.0%. Fig. 2b shows the energy breakdown. LLM inference still dominates with a 98.0% of the total energy, while robot control and data communication account for only 2.0%. Notice that the latency spent on control is low in the baseline system since the control frequency is set to match the front-end frame rate of 30 Hz. However, in real robotic systems, control usually has a much higher rate. Our characterization suggests that for each frame, to get a smooth trajectory, corresponding control latency can add up to 13.9% of the total latency.

Bottleneck Analysis.

Detailed characterization data suggests the reasons contributing to slow execution of embodied AI robots are mainly twofold. First, the frame-by-frame sequential execution pipeline forces every action of the robot to undergo three stages: LLM inference, robot control and communication, and the latencies accumulate. Second, LLM inference, even with high-end GPU acceleration, is still extremely slow. The above reasons motivate this work.

From the perspective of a robotic system designer, the planning and decision making module does not need to match the high frequency of the control module. Trajectory is usually used as a bridge to eliminate the frequency mismatch. We use the same principle here.

3. Corki Algorithm Framework

We introduce Corki algorithm in this part. The key insight of our algorithmic innovation is to change per-frame robot action prediction (Sec. 3.1) into robot trajectory prediction (Sec. 3.2). We further optimize the algorithm framework with an adaptive trajectory length selection (Sec. 3.3), which also provides accuracy and performance trade-off.

3.1. Baseline Embodied AI Algorithms

RoboFlamingo is comprised of two main components: a vision language model (VLM) (alayrac2022flamingo, ) and a LSTM network (hochreiter1997long, ) named policy head. At every time step t𝑡titalic_t, the VLM takes visual observations Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and a language instruction i𝑖iitalic_i as input, and outputs vision-language tokens Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The robot actions atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are generated through the policy head using given Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (li2023vision, ).

Refer to caption
Fig. 3. RoboFlamgino policy head. The vision-language token is from the LLM inference. The outputs are seven-dimensional variables that represent the movements of the robot in the next time step.

We elaborate on the action generation process in Fig. 3. At each time step t𝑡titalic_t, the policy head takes the visual-language tokens Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated by the LLM as input and goes through a LSTM network. The hidden state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then mapped to the 7-DoF action through two MLP heads as shown in Equ. 2:

(2) at pose ,at gripper =MLP(ht).superscriptsubscript𝑎𝑡 pose superscriptsubscript𝑎𝑡 gripper MLPsubscript𝑡a_{t}^{\text{ pose }},a_{t}^{\text{ gripper }}=\operatorname{MLP}(h_{t}).italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pose end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gripper end_POSTSUPERSCRIPT = roman_MLP ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The training loss thus contains two parts, as illustrated in Equ. 3, the pose estimation is supervised using mean squared error (MSE) loss, while the gripper status is supervised using binary cross-entropy (BCE) loss. The weight λ𝜆\lambdaitalic_λ is used to balance the two parts.

(3) =tMSE(at pose ,a^t pose )+λBCE(at gripper ,a^t gripper )subscript𝑡MSEsuperscriptsubscript𝑎𝑡 pose superscriptsubscript^𝑎𝑡 pose 𝜆BCEsuperscriptsubscript𝑎𝑡 gripper superscriptsubscript^𝑎𝑡 gripper \displaystyle\ell=\sum_{t}\operatorname{MSE}(a_{t}^{\text{ pose }},\hat{a}_{t}% ^{\text{ pose }})+\lambda\operatorname{BCE}(a_{t}^{\text{ gripper }},\hat{a}_{% t}^{\text{ gripper }})roman_ℓ = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_MSE ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pose end_POSTSUPERSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pose end_POSTSUPERSCRIPT ) + italic_λ roman_BCE ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gripper end_POSTSUPERSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gripper end_POSTSUPERSCRIPT )

During inference, the policy head maintains a queue of length 12. If the queue is not full, the policy head will predict the action atposesuperscriptsubscript𝑎𝑡posea_{t}^{\text{pose}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pose end_POSTSUPERSCRIPT, atgrippersuperscriptsubscript𝑎𝑡grippera_{t}^{\text{gripper}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gripper end_POSTSUPERSCRIPT and update the hidden state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the next step prediction; once the queue reaches its maximum capacity, the earliest tokens that entered the queue will be replaced by latest ones, then, consistent with the training process, the action of the current step atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given based on the last 12 vision-language tokens Xt11Xtsubscript𝑋𝑡11subscript𝑋𝑡X_{t-11}\thicksim X_{t}italic_X start_POSTSUBSCRIPT italic_t - 11 end_POSTSUBSCRIPT ∼ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.2. Basic Corki Algorithm

We think the fundamental design principle of current embodied AI algorithms is to better supervise the output of every frame. However, the frame-by-frame supervision violates the philosophy of the robotic system. We thus introduce to predict trajectory instead, describe the corresponding training modifications, and further improve our design through an adaptive trajectory length decision during runtime.

Trajectory Prediction.

We predict the continuous trajectory of the nearest future instead of discrete actions. We use a cubic function to fit the motion trajectory of robotic arms. For all the seven variables we need to predict, we output a trajectory for each one of the first six variables (rx(t),ry(t),rz(t),rα(t),rβ(t),rγ(t)subscript𝑟𝑥𝑡subscript𝑟𝑦𝑡subscript𝑟𝑧𝑡subscript𝑟𝛼𝑡subscript𝑟𝛽𝑡subscript𝑟𝛾𝑡r_{x}(t),r_{y}(t),r_{z}(t),r_{\alpha}(t),r_{\beta}(t),r_{\gamma}(t)italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_t ) , italic_r start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_t )), the gripper g𝑔gitalic_g is still a binary value. Using rx(t)subscript𝑟𝑥𝑡r_{x}(t)italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_t ) as an example, the model output will be shown as Equ. 4, t𝑡titalic_t is time.

(4) rx(t)=at3+bt2+ct+dsubscript𝑟𝑥𝑡𝑎superscript𝑡3𝑏superscript𝑡2𝑐𝑡𝑑\displaystyle r_{x}(t)=at^{3}+bt^{2}+ct+ditalic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_t ) = italic_a italic_t start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_b italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c italic_t + italic_d

Loss Design.

After we change the model output, there are two ways of designing loss. The first one is directly supervising a,b,c,d𝑎𝑏𝑐𝑑a,b,c,ditalic_a , italic_b , italic_c , italic_d. The second one is to supervise the trajectory with the ground truth. We go for the second one for two reasons. The first reason is that usually, no dataset provides the a,b,c,d𝑎𝑏𝑐𝑑a,b,c,ditalic_a , italic_b , italic_c , italic_d ground truth, which means we need to extract it from the trajectory ground truth first, which accumulates errors. Second, these parameters vary significantly and are not conducive to the learning of the neural network.

Using variable rx(t)subscript𝑟𝑥𝑡r_{x}(t)italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_t ) as an example. We supervise the trajectory action Txsubscript𝑇𝑥T_{x}italic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT in the training set and our predicted trajectory rxsubscript𝑟𝑥r_{x}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT using the MSE shown in Equ. 5. Then, we update our trajectory parameters through backpropagation. In this way, we no longer need to get discrete actions with 30 Hz first and can directly monitor the trajectory to obtain a more capable model.

(5) x=j=0kMSE(rx(j),Tx(j))subscript𝑥superscriptsubscript𝑗0𝑘MSEsubscript𝑟𝑥𝑗subscript𝑇𝑥𝑗\displaystyle\ell_{x}=\sum_{j=0}^{k}\operatorname{MSE}(r_{x}(j),T_{x}(j))roman_ℓ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_MSE ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_j ) , italic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_j ) )

Because of our design, the robotic control and vision inputs are decomposed, leading to less information captured by the robots. To mimic this process during training, we intentionally insert fewer images. As shown in Fig. 4, vision-language tokens from t=2𝑡2t=2italic_t = 2 to t=4𝑡4t=4italic_t = 4 are shed by a mask embedding, similar to existing works such as BERT (devlin2019bert, ).

3.3. Optimizing Corki Algorithm

In the basic Corki algorithm, the length of the trajectory is fixed all the time. Suppose the prediction interval is set to be 16.5 ms. No shorter or longer trajectory can be taken. However, one of the most significant characteristics of robotic applications is that they usually encounter sudden environmental changes.

Early Termination.

We thus provide flexibility in the length of the trajectory that is taken. The prediction length will be used as an upper bound of the length of the actual taken trajectory, and early termination is allowed. Again, assuming the prediction interval is 16.5 ms, the actual trajectory can be from 3.3 ms to 16.5 ms, with a stride of ΔtΔ𝑡\Delta troman_Δ italic_t (which is 3.3 ms assuming the camera sensor works in a 30 Hz frequency). After the robot’s early termination, the model will predict another trajectory for the 16.5 ms.

Early termination gives us some flexibility, but it may not be enough. The reason is that the accuracy is higher when the actual trajectory length is consistent between training and inference. If the actual trajectory length is 6.6 ms in training, the same length should be taken during inference. Suppose the user wants to change the actual trajectory length. In that case, the only way is to train two models, one for 6.6 ms and one for 9.9 ms, and switch during inference, which is unsurprisingly inconvenient for almost all robotic applications.

Refer to caption
Fig. 4. Masked policy head. The tokens in the dotted line are not generated through a LLM but instead a mask embedding.

Adaptive Trajectory Length.

Our method is to increase flexibility by allowing adaptive trajectory length with an empirical method. Our insight comes from the curvature of the trajectory. When the curvature is low, the action does not change significantly, suggesting a longer trajectory is acceptable. However, when the curvature is high, the usual circumstance is that the robot is encountering sudden change, where a shorter trajectory is better.

Waypoints Extraction.

We identify the adaptive trajectory length using a concept called waypoints. For example, for a given trajectory spanning 16.5 ms, a waypoint is defined as a point on the trajectory every 3.3 ms or each time step. In Fig. 5, point A𝐴Aitalic_A is the starting point, points B𝐵Bitalic_B to F𝐹Fitalic_F are the waypoints, and point F𝐹Fitalic_F is the endpoint. Waypoint identification aims to find a waypoint where the robot’s movement is significant. In our case, significant movements are identified as high curvature or changes in the gripper state.

Waypoints Identification.

We check each waypoint from B𝐵Bitalic_B to F𝐹Fitalic_F and compare two metrics to identify potential waypoints with high curvature. Given the example in Fig. 5, the current waypoint undergoes checking is D𝐷Ditalic_D. For every point in the interval of [B,D)𝐵𝐷[B,D)[ italic_B , italic_D ), we compare two metrics with corresponding thresholds. The first is the BAD𝐵𝐴𝐷\angle{BAD}∠ italic_B italic_A italic_D and BDA𝐵𝐷𝐴\angle{BDA}∠ italic_B italic_D italic_A with a threshold of 90 degrees. The second one is the distance between point B𝐵Bitalic_B to line AD𝐴𝐷ADitalic_A italic_D, or d(B,AD)𝑑𝐵𝐴𝐷d(B,AD)italic_d ( italic_B , italic_A italic_D ) with a threshold d𝑑ditalic_d. If any threshold is violated, we consider the curvature at a point between C𝐶Citalic_C and D𝐷Ditalic_D to be high, and thus, the trajectory should end at D𝐷Ditalic_D instead of the predicted point F𝐹Fitalic_F. The length of the trajectory depends on the endpoint we get.

To find potential waypoints with gripper state changes, we compare the state of the gripper at the current waypoint and the next waypoint. If the gripper states of these two waypoints are different, the current waypoint will be identified as one with significant movement.

Refer to caption
Fig. 5. Waypoints extraction and identification algorithm. The first waypoint with huge movement will be identified and taken as the endpoint of the trajectory.
Algorithm 1 Adaptive trajectory length.
0:  Starting Point A𝐴Aitalic_A, Trajectory T𝑇Titalic_T Gripper states G𝐺Gitalic_G = 0,0,0,1,0
0:  The earliest termination point //Extracting waypoints at each time step.B,C,D,E,F𝐵𝐶𝐷𝐸𝐹B,C,D,E,Fitalic_B , italic_C , italic_D , italic_E , italic_F = E(A,T)𝐸𝐴𝑇E(A,T)italic_E ( italic_A , italic_T )
1:  for P𝑃Pitalic_P in the range[B,F𝐵𝐹B,Fitalic_B , italic_Fdo
2:     Pnsubscript𝑃𝑛absentP_{n}\leftarrowitalic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← the next waypoint of P𝑃Pitalic_P
3:     if G(P)orG(Pn)=1𝐺𝑃or𝐺subscript𝑃𝑛1G(P)\,\textbf{or}{}\,G(P_{n})=1italic_G ( italic_P ) or italic_G ( italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 1  then
4:        return  P𝑃Pitalic_P
5:     end if
6:     for p𝑝pitalic_p in the range (A,P] do
7:        if (p,AP)or(p,PA)>π2𝑝𝐴𝑃or𝑝𝑃𝐴𝜋2\angle(p,AP)\,\textbf{or}{}\,\angle(p,PA)>\frac{\pi}{2}∠ ( italic_p , italic_A italic_P ) or ∠ ( italic_p , italic_P italic_A ) > divide start_ARG italic_π end_ARG start_ARG 2 end_ARG —— D(p,AP)>d𝐷𝑝𝐴𝑃𝑑D(p,AP)>ditalic_D ( italic_p , italic_A italic_P ) > italic_d then
8:           return  P𝑃Pitalic_P
9:        end if
10:     end for
11:  end for
12:  return  F𝐹Fitalic_F

We explain the process in Algo. 1. As the adaptive trajectory length is determined during runtime, the latency is thus sensitive. The algorithm we propose is effective and with low latency. In most cases, the total computational cost of Algo. 1 is less than 500 FLOPs.

We provide users with an algorithm framework. Users can decide the length of trajectory prediction, whether early termination is needed, the level of early termination, and whether adaptive trajectory length is needed.

3.4. Close-loop Control

Till now, Corki is performing open-loop control. The algorithm will produce a trajectory with various lengths, and until the following inference happens, there is no feedback information for the robots. However, open-loop control can lead to lower performance as it easily accumulates errors.

We modify the open-loop feature. During the execution of each trajectory, we randomly send images back before the endpoint of the trajectory. These images are encoded using a convolutional neural network ViT (dosovitskiy2020image, ). The post-encoding features and tokens generated through the LLM are concatenated and used to predict the subsequent trajectory.

4. Corki Hardware and System Design

This section introduces the Corki hardware. We accelerate the control process to achieve real-time performance. The input of the control module is the trajectory predicted by Corki algorithm, and the output is torque signals that will be used on the motors in each joint of the robots. We call the control framework task space computed torque control (TS-CTC). We first elaborate the control framework (Sec. 4.1), then analyze the bottleneck and propose Corki accelerator (Sec. 4.2). We further propose an effective approximation strategy to improve the control frequency (Sec. 4.3). We finally describe the system pipeline (Sec. 4.4).

4.1. Task Space Computed Torque Control

Workflow.

The task space computed torque control (TS-CTC) method is widely used in robotics for precise manipulation tasks due to its ability to handle reference inputs in the task space (murray2017mathematical, ). We show the control framework in Fig. 6.

(6) τ=JT(θ)[Mx\displaystyle\tau=J^{T}(\theta)[M_{x}italic_τ = italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_θ ) [ italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (θ)(x¨d+Kpe+Kve˙)+hx(θ,θ˙)]\displaystyle(\theta)(\ddot{x}_{d}+K_{p}e+K_{v}\dot{e})+h_{x}(\theta,\dot{% \theta})]( italic_θ ) ( over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_e + italic_K start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT over˙ start_ARG italic_e end_ARG ) + italic_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ , over˙ start_ARG italic_θ end_ARG ) ]
e𝑒\displaystyle eitalic_e =xdxe˙=x˙dx˙formulae-sequenceabsentsubscript𝑥𝑑𝑥˙𝑒subscript˙𝑥𝑑˙𝑥\displaystyle=x_{d}-x\quad\dot{e}=\dot{x}_{d}-\dot{x}= italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_x over˙ start_ARG italic_e end_ARG = over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - over˙ start_ARG italic_x end_ARG
Refer to caption
Fig. 6. Task space computed torque control.

The input of TS-CTC has two parts. The first part is the reference trajectory xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the first order derivative x˙dsubscript˙𝑥𝑑\dot{x}_{d}over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (velocity) and the second order derivative x¨dsubscript¨𝑥𝑑\ddot{x}_{d}over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (acceleration) of the reference trajectory. The second part is the joint angles θ𝜃\thetaitalic_θ and joint angular velocities θ˙˙𝜃\dot{\theta}over˙ start_ARG italic_θ end_ARG of the robot arm from the sensors. The output is the joint torque τ𝜏\tauitalic_τ. We describe the control process in Equ. 6. To achieve smooth robot control, the frequency of generating torques should be at least 100 Hz (yang2023dadu, ; dantec2021whole, ).

Key Computing Blocks.

TS-CTC contains five key computing blocks, which are the most computationally intensive part of the whole process. We show them as red blocks in Fig. 6. The forward kinematics block calculates the pose x𝑥xitalic_x of the end-effector in the task space based on the joint angles θ𝜃\thetaitalic_θ. The Jacobian block calculates the Jacobian matrix J(θ)𝐽𝜃J(\theta)italic_J ( italic_θ ) and the velocity x˙˙𝑥\dot{x}over˙ start_ARG italic_x end_ARG of the end-effector in the task space based on the joint angles θ𝜃\thetaitalic_θ and velocities θ˙˙𝜃\dot{\theta}over˙ start_ARG italic_θ end_ARG. The task space mass matrix block computes the inertial matrix Mx(θ)subscript𝑀𝑥𝜃M_{x}(\theta)italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ ) of the robot arm in the task space based on the joint angles θ𝜃\thetaitalic_θ. The task space bias force block computes the bias force hx(θ,θ˙)subscript𝑥𝜃˙𝜃h_{x}(\theta,\dot{\theta})italic_h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ , over˙ start_ARG italic_θ end_ARG ) applied to the robot arm in the task space based on the joint angles θ𝜃\thetaitalic_θ and velocities θ˙˙𝜃\dot{\theta}over˙ start_ARG italic_θ end_ARG.

4.2. Corki Hardware

Refer to caption
Fig. 7. The results of the key blocks are reusable, as indicated by the arrows. Later stages consume partial results from early stages.

Bottleneck Characterization.

We analyze the compute patterns of the above control algorithms and identify two key characteristics. First, as shown in Fig. 7, a significant amount of intermediate data is reusable. For instance, the calculation of the Jacobian matrix reuses results from forward kinematics. Similarly, the computation of the mass matrix and bias force reuses results from the Jacobian matrix and its transpose. Second, all blocks primarily consist of four basic operations: computing the pose of each link, the velocity of each link, the acceleration of each link, and the force of each link. Due to physical laws (e.g., acceleration is the derivative of velocity), these operations follow fixed data dependencies. For example, the velocity operator consumes a six-dimensional vector from the pose operator to calculate a six-dimensional vector representing velocity. A similar trend exists between the acceleration and force operators.

Refer to caption
Fig. 8. Hardware architecture for efficiently solving TS-CTC. Blocks in blue are in the format of a dataflow accelerator, and blocks in yellow are customized circuits.

Hardware Architecture.

Leveraging the above analysis, our hardware design has two main goals. First, we aim to customize circuits and data pipelines to maximize intermediate data reuse, achieving high parallelization and performance. Second, we focus on customizing on-chip SRAM design to enable single read and write operations during computation, eliminating extra memory accesses.

Fig. 8 shows the Corki architecture, which consists of two parts. The blue blocks form a dataflow accelerator, where all main operators are connected through three FIFOs and a line buffer (LB). This design enables extreme pipelining; for example, the velocity calculation of the first link can start while the pose calculation of the second link begins. The yellow blocks are customized circuits, with the task space mass matrix unit reusing data from the pose unit and the task space bias force unit reusing data from both the velocity unit and the torque unit. There are occasional stalls in the accelerators due to differing latencies between the dataflow accelerator and the customized circuits. A simple micro-controller manages the control flow of the accelerator.

Our on-chip buffer design is highly effective. In the first four stages of the dataflow accelerator, three FIFOs store intermediate data, as the producer and consumer rates are identical. A line buffer between the force unit and the torque unit captures the rate mismatch between them. The remaining intermediate data is stored in a small scratchpad memory. This combination of different on-chip buffer designs allows for minimal on-chip SRAM consumption while ensuring no data communication with off-chip DRAM during execution.

4.3. Application-specific Approximate Computing

Opportunity.

We observe that robotic control has a unique feature: the compute frequency is high, yet the change in each control signal is low. For a 7-DoF robot arm, the movement in each joint is minimal each time. However, the computation of control signals is based on joints, as illustrated in the previous section. A joint-based approximation is possible to further save computation and reduce latency.

Quantitative Analysis.

To quantitatively demonstrate our observation, we perform an experiment. We use a 7-DoF Franka Emika Panda robot arm (gaz2019dynamic, ) and monitor the item-wise changes in the mass matrix while slightly adjusting each joint by an angle. For example, we first record all the items in the mass matrix, then change the first joint by 0.1 radians (approximately 6 degrees), 0.3 radians (approximately 17 degrees), and 0.5 radians (approximately 29 degrees), monitoring the changes in the mass matrix. We repeat the same experiments for all the joints in the robot arm.

We show the results in Fig. 9. The results indicate that when motion occurs in joints 1 and 7, the mass matrix remains nearly constant. This phenomenon is illustrated in the top right and bottom right figures in Fig. 10. Movements in the end joints (joint 1 and joint 7) have minimal impact on the morphology of the robot arm, leading to less significant changes in the mass matrix. Similarly, for joints 5 and 6, the maximum variation in matrix elements does not exceed 0.1 even with an angular change of 29 degrees.

However, for the joints in the middle of the robot arm, the situation is different. When joint 2 moves, even a change of 6 degrees results in a maximum absolute change in matrix elements of 0.17 (with a maximum relative change of approximately 15.4%). When the motion increases to 29 degrees, the maximum relative change in elements can be as high as 45.2%. The bottom left figure in Fig. 10 shows that when the middle joints undergo movement, the morphology of the robot arm is significantly changed, necessitating the re-computation of all parameters in the control process.

Refer to caption
Fig. 9. The maximum difference in the elements of the mass matrix before and after movements in the joint. The experiments are conducted on a Franka Emika Panda robot arm. The movement consists of rotation with angles of 6 degrees, 17 degrees and 29 degrees on all 7 joints.
Refer to caption
Fig. 10. The morphology of the Franka Emika Panda robot arm in different configurations. We change joint 1, joint 2 and joint 5 by 29 degrees and show the difference.

Approximate Computation.

We design a simple yet effective approximate computing method to dynamically update the control parameters, reducing the computational costs in the control process. Specifically, given the input θ𝜃\thetaitalic_θ, we first compute the probability of each matrix (e.g., Jacobian matrix, mass matrix, etc.) needing an update based on an impact factor derived from the angular movement of each joint. In this process, the joints with a small impact on parameter changes have smaller impact factors, while the joints with a large impact on parameter changes have larger impact factors. The probability computation consumes less than 100 FLOPs, which does not affect the final latency.

If the probability of updating a matrix exceeds a certain threshold, the corresponding computation to generate that matrix is performed. Otherwise, the corresponding elements from the previous control cycle are reused. We observe that over 51% of matrix updates can be avoided without any loss in control accuracy.

4.4. System Pipeline

There are three key components in the system we propose. First, network inference that happens on the server will predict the trajectory. The parameters of the trajectory will be sent to the controller, which is located on the robot. The controller calculates the high frequency actual control signals to enable the robot to move as the trajectory plans, and the robot will move according to the control signals. During the movement of the robot, at random time steps before the trajectory ends, images will be captured by the camera mounted on the robot. These images will be sent back to the server, while the robot continues to finish the rest of the trajectory. Thus, the communication and robotic control can be executed in a parallel way. When the robot reaches the end of the trajectory, it will capture another image and send it back to the server. A new trajectory will be predicted through the LLM inference using this image and previous images.

5. Experimental Methodology

This section describes our evaluation methodology. First, we will discuss the experimental setup, including the software, dataset, and hardware (Sec. 5.1). Then, we will cover the baselines we compare and the variations of Corki (Sec. 5.2).

5.1. Experimental Setup

We build Corki on the foundation of RoboFlamingo, but our work is extensible to other action-prediction-based embodied AI robots. We implement the algorithm innovation in PyTorch (paszke2019pytorch, ), where the network output predicts a trajectory instead of a discrete action. This predicted trajectory is then fed back into the simulation environments to test the robot’s task completion capabilities.

We use the Calvin (mees2022calvin, ) dataset and software simulation environments, one of the most widely used embodied AI datasets. Calvin includes 34 different tasks with 22994 demonstrations for training and 1000 sequences for testing. We evaluate our algorithm in two different scenarios: Seen scenarios, where the tasks in the testing set are similar but not identical to those encountered during training, and unseen scenarios, which are more challenging as the tasks are entirely new and have not been encountered during training.

Tasks and Metrics.

The tasks are categorized into five types: moving an object, turning a switch on and off, pushing and pulling a drawer, rotating an object, and lifting an object. We use two metrics to evaluate the algorithm’s accuracy: success rate and average job length. The success rate is the most straightforward metric for quantifying a single task, calculated as the number of successful sequences divided by the total sequences. Given that the embodied AI algorithms are designed to improve robots’ abilities on long-horizon jobs, we further report the accuracy on finishing a job. Each job contains five consecutive tasks. The average job length measures how many tasks the robot can complete within a job, with a maximum of 5.

Trajectory Comparison.

We further utilize two different metrics to illustrate why the results we predict are better:

  • Mean trajectory error. We compare the geographic distance between the predicted trajectory and the ground truth, using root mean square error (RMSE) as the metric. Generally, a smaller RMSE indicates better trajectory prediction.

  • Maximum trajectory distance. We also compare the maximum distance between the predicted and ground truth trajectories. A larger maximum distance denotes a higher likelihood of failure.

Hardware.

We evaluate the inference latency and energy consumption using a Nvidia V100 GPU, measuring power with NVML (vonnvml, ). Control latency and power are measured using an Intel 13th generation i7-13700 CPU. We implement Corki hardware on a Xilinx Zynq-7000 SoC ZC706 FPGA (zc706, ) to assess real hardware performance. Additionally, we establish Wi-Fi communication between a 7-DoF Franka Emika Panda robot arm (gaz2019dynamic, ) and our server to measure communication latency.

5.2. Baselines and Variations

Baselines.

We train RoboFlamingo using the Calvin dataset for accuracy comparison. The results are either higher or equivalent to the reported version. For latency and energy consumption comparisons, we establish a baseline using the traditional execution pipeline of embodied AI algorithms, where the inference latency, control latency, and communication latency are accumulated each frame.

Variations.

As discussed earlier, Corki can predict the trajectory of the next N𝑁Nitalic_N steps, with each step taking approximately 3.3 ms. Given the predicted trajectory covering N𝑁Nitalic_N steps, the robots can take anywhere from 1 step to up to N𝑁Nitalic_N steps. Longer steps reduce the inference frequency but may also lead to lower accuracy. In our evaluation, we predict nine steps each time and vary the steps taken from 1 to 9 with a stride of 2, creating five variations named Corki-T, where T represents the actual steps taken.

In addition to the fixed step variations, we evaluate adaptive options as discussed in Section 3.3. We name this variation Corki-ADAP. In Corki-ADAP, the robot’s steps are selected by the waypoints identification algorithm and are smaller than N𝑁Nitalic_N.

6. Evaluation

We evaluate Corki in this section. We first show Corki accelerator has a low hardware resource consumption (Sec. 6.1). We then evaluate both accuracy of Corki (Sec. 6.2) and corresponding latency and energy saving (Sec. 6.3).

6.1. Hardware Resource Consumption

The Corki accelerator is compact and does not require significant hardware resources, making it feasible for deployment on a real robot. It consumes only 13.6% of digital signal processors (DSP), 7.8% of flip-flops (FF), and 16.9% of look-up tables (LUT). The specialized on-chip buffer design is effective; the Corki accelerator utilizes only 6.6% of the total block random access memory (BRAM), with no data communication with off-chip DRAM during each control process.

6.2. Accuracy

Table 1. Accuracy on seen tasks. Baseline is retrained.
Variation Task Completed in a Sequence
1 2 3 4 5 Avg Len
RoboFlamingo 89.5% 71.9% 55.6% 43.4% 31.2% 2.916
Corki-1 89.1% 75.3% 59.2% 47.1% 37.1% 3.078
Corki-3 89.4% 75.7% 62.6% 52.9% 42.8% 3.234
Corki-5 92.3% 80.0% 67.4% 56.6% 45.8% 3.421
Corki-7 89.1% 73.8% 59.5% 48.7% 38.1% 3.092
Corki-9 88.0% 72.0% 56.4% 46.3% 35.6% 2.983
Corki-ADAP 93.5% 77.7% 61.4% 49.1% 38.3% 3.2
Table 2. Accuracy on unseen tasks. Baseline is retrained.
Variation Task Completed in a Sequence
1 2 3 4 5 Avg Len
RoboFlamingo 82.4% 61.9% 46.6% 33.1% 23.5% 2.48
Corki-1 86.0% 68.0% 52.6% 40.3% 30.0% 2.769
Corki-3 83.2% 65.6% 50.7% 37.2% 27.5% 2.642
Corki-5 85.9% 68.4% 54.3% 42.2% 31.6% 2.824
Corki-7 83.8% 65.5% 50.5% 40.6% 31.9% 2.723
Corki-9 79.4% 59.5% 44.0% 33.7% 24.7% 2.413
Corki-ADAP 85.7% 69.4% 54.1% 41.9% 31.6% 2.827
Refer to caption
(a) Mean trajectory error.
Refer to caption
(b) Maximum trajectory distance.
Fig. 11. Trajectory comparison between Corki and RoboFlamingo with two quantitative metrics.
Refer to caption
(a) X dimension trajectory.
Refer to caption
(b) Y dimension trajectory.
Refer to caption
(c) Z dimension trajectory.
Fig. 12. Trajectory comparison of a randomly picked sequence from the test set. It is clearly shown that trajectories of Corki can follow the ground truth, while trajectories of Roboflamingo are off the target. We only show Corki-5 for simplicity.

Success Rate and Average Job Length.

We show accuracy results on seen scenarios and unseen scenarios in Tbl. LABEL:D-D and Tbl. LABEL:ABC-D. Almost all variations of Corki outperform the baseline in terms of both success rate and average job length, except for Corki-9 in unseen scenarios. On average, Corki improves the success rate by 8.6% and the average job length by 0.3. In unseen scenarios, these improvements are 8.1% and 0.2, respectively.

Among all fixed-step variations of Corki, Corki-5 achieves the highest accuracy and significantly outperforms the baseline. On seen tasks, it improves the average job length by 17.3% compared to the baseline, with a gain of 0.5 in job length. The trend observed among all Corki variations is that accuracy improves as the length of the actual trajectory taken increases. However, after reaching its peak accuracy, there is a gradual degradation in performance when the length of the actual trajectory taken continues to increase.

Corki-ADAP selects the length of the actual trajectory through identifying waypoints with significant movements. We observe that the results of Corki-ADAP fall between those of Corki-7 and Corki-5 in seen tasks, and it even outperforms Corki-5 in unseen tasks. This demonstrates that determining length during runtime is effective.

Understanding the Results.

The improvement brought by Corki is significant. Corki outperforms the baseline in almost all cases because trajectory naturally provides more robotic-friendly supervision during algorithm training. When the datasets of embodied AI algorithms are constructed, the collection of the ground truth was in the form of trajectory at first. In contrast, if discrete actions with 30 Hz frequency are used for supervision, the trajectory must first be decomposed into actions in a frame-basis and then used to train the model. Second, a smooth trajectory with high frequency control certainly improves success rate, which are demonstrated in robotic community (kleff2021high, ).

When early termination of Corki is applied, the accuracy trend initially increases and then decreases. This is because the shorter the length of the actual trajectory, the closer it aligns with discrete action supervision. However, if the trajectory taken by the robot is too long, useful environmental information may not be captured and utilized effectively, as the closed-loop feedback also operates at a lower frequency.

Corki-ADAP works. This result validates our intuition that predicting a new trajectory whenever a significant movement occurs, such as a high curvature on the trajectory or a change in the status of the gripper, is beneficial.

Trajectory Comparison.

The accuracy of our applications is directly related to the correctness of the trajectory. Therefore, we provide detailed trajectory data for evaluation. We compare the error on the trajectory and show it in Fig. 11. On average, Corki reduces the error by 25.0%.

However, we have also observed that a lower trajectory error does not always correlate with higher accuracy. For instance, although Corki-3 has a lower mean trajectory error compared to Corki-5, its success rate and average job length are lower. This discrepancy arises because the trajectory only reflects the trend of the robotic arm and cannot be treated as a perfect indicator of success rate. Additionally, this statistic does not account for the status of the gripper, which is also critical to the success of tasks.

We further illustrate the differences in trajectories with a real example. We compare trajectories on three dimensions separately and present the results in Fig. 12.

Although the baseline method can generate trajectories close to the ground truth on the Y dimension (Fig. 12b) and Z dimension (Fig. 12c), it clearly deviates from the target on the X dimension at time step 40 (Fig. 12a). In contrast, Corki maintains alignment with the ground truth across all three dimensions. These results again emphasize that while trajectory is related to the success rate, it cannot fully determine task success. Even though Corki’s trajectory slightly differs on the X dimension compared to the ground truth, it still successfully completes the task.

6.3. Performance Comparison

Refer to caption
Fig. 13. Runtime latency and energy consumption comparison between Corki and baselines.

Latency Comparison.

We compare the latency and present the results in Fig. 13 on the left y-axis. Corki significantly reduces the frame latency of embodied AI robotic applications. Among the variations, Corki-9 achieves the best speedup of 3.6×3.6\times3.6 ×, as the inference frequency of the large language model is reduced by 8×8\times8 ×. As the length of the actual trajectory taken increases from 1 to 9, the speedup gradually increases from 1.1×1.1\times1.1 × to 3.6×3.6\times3.6 ×. On the other hand, Corki-ADAP demonstrates a speedup of 3.0×3.0\times3.0 ×, providing an ideal trade-off between accuracy and efficiency.

Energy Consumption Comparison.

Corki also significantly saves energy consumption. Corki-1 has slightly higher energy consumption compared to the baseline, as it takes one step for every predicted trajectory, which is similar to the baseline. Besides Corki-1, all Corki variations have significantly lower energy consumption. Corki-9 has a 8.9×8.9\times8.9 × energy reduction. Low energy consumption is critical to robots, which are mostly battery-supported devices.

Refer to caption
(a) Per-frame latency breakdown.
Refer to caption
(b) Per-frame energy breakdown.
Fig. 14. Per-frame latency and energy comparison between Corki and RoboFlamingo.

Frame-by-frame Analysis.

We finally show frame-by-frame analysis of latency and energy consumption for one single sequence. Fig. 14a shows the results of latency and Fig. 14b shows the results for energy consumption. Both latency and energy consumption of Corki are having a same trend, where the crest indicates the inference of LLM is happening at that time step, and trough means the robot is executing the trajectory predicted from the last time. Corki-5 has a periodical crest, as every 5 time steps, the inference will happen once. Corki-ADAP has a more flexible crest and trough, compared to Corki-5. This is due to the waypoints identification and flexible length of actual trajectory.

The acceleration comes from three sides. First, the inference frequency is largely reduced, which contributes to the most latency reduction. Second, Corki hardware successfully accelerates the control process by up to 29.0×29.0\times29.0 ×, reducing the control latency. Finally, communication latency between the robot and the server is hidden as we enable pipelining.

7. Related Work

Computing Systems for Embodied Artificial Intelligence.

Embodied Artificial Intelligence (EAI) differs from semantic AI by emphasizing agents, typically robots, that interact with the environment and execute long-horizon tasks. Recently, with the success of Large Language Models (LLMs) as planners, research in this domain has intensified, aiming to develop highly intelligent robots (duan2022survey, ; franklin1997autonomous, ; chrisley2003embodied, ; liu2024ok, ; vemprala2024chatgpt, ; huang2023voxposer, ). While most studies focus on enhancing functionalities, our research emphasizes real-time performance. Our approach is rooted in the robotic community, where trajectory serves as the fundamental unit of planning and control. This contrasts with the predominant vision-centric perspective, which treats images or frames as the basic units.

Accelerators for Robotic Applications.

With the growing interest in treating robots as a new computing platform, our community has increasingly focused on dedicated accelerators for robotic computing. These accelerators have been designed for localization (suleiman2019navion, ; gan2021eudoxus, ; liu2021archytas, ; liu2019eslam, ; eyvazpour2023hardware, ; sugiura2022universal, ; liu2022mobilesp, ; sugiura2021unified, ), motion planning (hsiao2023vapr, ; hao2023blitzcrank, ; huang2024moped, ; bakhshalipour2022racod, ; shah2023energy, ; lian2018dadu, ; murray2016microarchitecture, ; murray2019programmable, ), control (neuman2021robomorphic, ; neuman2023roboshape, ; yang2023dadu, ; lian2017dadu, ; sacks2018robox, ; shao2018towards, ; aude1991hardware, ; gac2012fpga, ), and more (mayoral2022robotcore, ; lienen2020reconros, ; hao2024orianna, ; lee2024spade, ; krishnan2022automatic, ; yu2020building, ). However, most accelerators focus on one or multiple modules within a traditional rule-based robotic computing system. Our work, in contrast, focuses on an end-to-end learning-based system, combining innovations in both algorithms and architecture, setting it apart from previous research.

8. Conclusion

Robots equipped with embodied AI algorithms often experience high latency due to the sequential execution pipeline and frequent LLM inference. In this paper, we propose Corki, a software-hardware co-design framework that significantly accelerates this process by transforming the algorithms to predict future trajectories, speeding up the control process, and pipelining communication with control. Results show that Corki achieves up to a 3.6×3.6\times3.6 × speedup. Corki also achieves a maximum 17.3% improvement in success rate.

References

  • [1] L. Lamport, : A Document Preparation System, 2nd ed.   Reading, Massachusetts: Addison-Wesley, 1994.
  • [2] F. Lastname1 and F. Lastname2, “A very nice paper to cite,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016.
  • [3] F. Lastname1, F. Lastname2, and F. Lastname3, “Another very nice paper to cite,” in Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture, 2015.
  • [4] F. Lastname1, F. Lastname2, F. Lastname3, F. Lastname4, F. Lastname5, F. Lastname6, F. Lastname7, F. Lastname8, F. Lastname9, F. Lastname10, F. Lastname11, and F. Lastname12, “Yet another very nice paper to cite, with many author names all spelled out,” in Proceedings of the 38th Annual International Symposium on Computer Architecture, 2011.
  • [5] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
  • [6] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
  • [7] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
  • [8] Y. Mei, Y.-H. Lu, Y. C. Hu, and C. G. Lee, “Deployment of mobile robots with energy and timing constraints,” IEEE Transactions on robotics, vol. 22, no. 3, pp. 507–522, 2006.
  • [9] O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots,” The international journal of robotics research, vol. 5, no. 1, pp. 90–98, 1986.
  • [10] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022.
  • [11] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2998–3009.
  • [12] X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y. Jing, W. Zhang, H. Liu et al., “Vision-language foundation models as effective robot imitators,” arXiv preprint arXiv:2311.01378, 2023.
  • [13] N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration,” arXiv preprint arXiv:2311.12015, 2023.
  • [14] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and Z. Tu, “Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration,” arXiv preprint arXiv:2306.09093, 2023.
  • [15] Y. Zhao, Z. Lin, D. Zhou, Z. Huang, J. Feng, and B. Kang, “Bubogpt: Enabling visual grounding in multi-modal llms,” arXiv preprint arXiv:2307.08581, 2023.
  • [16] Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y. Qiao, and P. Luo, “Embodiedgpt: Vision-language pre-training via embodied chain of thought,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [17] H. Huang, O. Zheng, D. Wang, J. Yin, Z. Wang, S. Ding, H. Yin, C. Xu, R. Yang, Q. Zheng et al., “Chatgpt for shaping the future of dentistry: the potential of multi-modal large language model,” International Journal of Oral Science, vol. 15, no. 1, p. 29, 2023.
  • [18] H. Everett, Sensors for mobile robots.   CRC Press, 1995.
  • [19] P. Li and X. Liu, “Common sensors in industrial robots: A review,” in Journal of Physics: Conference Series, vol. 1267, no. 1.   IOP Publishing, 2019, p. 012036.
  • [20] G. Santaera, E. Luberto, A. Serio, M. Gabiccini, and A. Bicchi, “Low-cost, fast and accurate reconstruction of robotic and human postures via imu measurements,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2015, pp. 2728–2735.
  • [21] C. Gaz, M. Cognetti, A. Oliva, P. R. Giordano, and A. De Luca, “Dynamic identification of the franka emika panda robot with retrieval of feasible parameters using penalty-based optimization,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4147–4154, 2019.
  • [22] Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y. Xie, T. Zhang, Z. Zhao et al., “Toward general-purpose robots via foundation models: A survey and meta-analysis,” arXiv preprint arXiv:2312.08782, 2023.
  • [23] R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman et al., “Foundation models in robotics: Applications, challenges, and the future,” arXiv preprint arXiv:2312.07843, 2023.
  • [24] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
  • [25] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [26] O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,” IEEE Robotics and Automation Letters (RA-L), vol. 7, no. 3, pp. 7327–7334, 2022.
  • [27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
  • [28] Xilinx, “Xilinx zynq-7000 soc zc706 evaluation kit,” https://www.xilinx.com/products/boards-and-kits/ek-z7-zc706-g.html, accessed: 2024-06-1.
  • [29] P. von Behren, “Nvml: Implementing persistent memory applications.”
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  • [31] L. X. Shi, A. Sharma, T. Z. Zhao, and C. Finn, “Waypoint-based imitation learning for robotic manipulation,” arXiv preprint arXiv:2307.14326, 2023.
  • [32] R. M. Murray, Z. Li, and S. S. Sastry, A mathematical introduction to robotic manipulation.   CRC press, 2017.
  • [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [34] S. Franklin, “Autonomous agents as embodied ai,” Cybernetics & Systems, vol. 28, no. 6, pp. 499–520, 1997.
  • [35] R. Chrisley, “Embodied artificial intelligence,” Artificial intelligence, vol. 149, no. 1, pp. 131–150, 2003.
  • [36] P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Ok-robot: What really matters in integrating open-knowledge models for robotics,” arXiv preprint arXiv:2401.12202, 2024.
  • [37] S. H. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles and model abilities,” IEEE Access, 2024.
  • [38] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” in Conference on Robot Learning.   PMLR, 2023, pp. 540–562.
  • [39] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion: A 2-mw fully integrated real-time visual-inertial odometry accelerator for autonomous navigation of nano drones,” IEEE Journal of Solid-State Circuits, vol. 54, no. 4, pp. 1106–1119, 2019.
  • [40] Y. Gan, B. Yu, B. Tian, L. Xu, W. Hu, S. Liu, Q. Liu, Y. Zhang, J. Tang, and Y. Zhu, “Eudoxus: Characterizing and accelerating localization in autonomous machines industry track paper,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2021, pp. 827–840.
  • [41] W. Liu, B. Yu, Y. Gan, Q. Liu, J. Tang, S. Liu, and Y. Zhu, “Archytas: A framework for synthesizing and dynamically optimizing accelerators for robotic localization,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 479–493.
  • [42] R. Liu, J. Yang, Y. Chen, and W. Zhao, “eslam: An energy-efficient accelerator for real-time orb-slam on fpga platform,” in Proceedings of the 56th Annual Design Automation Conference 2019, 2019, pp. 1–6.
  • [43] Y.-S. Hsiao, S. K. S. Hari, B. Sundaralingam, J. Yik, T. Tambe, C. Sakr, S. W. Keckler, and V. J. Reddi, “Vapr: Variable-precision tensors to accelerate robot motion planning,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 6304–6309.
  • [44] R. Eyvazpour, M. Shoaran, and G. Karimian, “Hardware implementation of slam algorithms: a survey on implementation approaches and platforms,” Artificial Intelligence Review, vol. 56, no. 7, pp. 6187–6239, 2023.
  • [45] K. Sugiura and H. Matsutani, “A universal lidar slam accelerator system on low-cost fpga,” IEEE Access, vol. 10, pp. 26 931–26 947, 2022.
  • [46] Y. Liu, J. Li, K. Huang, X. Li, X. Qi, L. Chang, Y. Long, and J. Zhou, “Mobilesp: An fpga-based real-time keypoint extraction hardware accelerator for mobile vslam,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 12, pp. 4919–4929, 2022.
  • [47] K. Sugiura and H. Matsutani, “A unified accelerator design for lidar slam algorithms for low-end fpgas,” in 2021 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2021, pp. 1–9.
  • [48] Y. Hao, Y. Gan, B. Yu, Q. Liu, S.-S. Liu, and Y. Zhu, “Blitzcrank: Factor graph accelerator for motion planning,” in 2023 60th ACM/IEEE Design Automation Conference (DAC).   IEEE, 2023, pp. 1–6.
  • [49] L. Huang, Y. Gong, Y. Sui, X. Zang, and B. Yuan, “Moped: Efficient motion planning engine with flexible dimension support,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2024, pp. 483–497.
  • [50] M. Bakhshalipour, S. B. Ehsani, M. Qadri, D. Guri, M. Likhachev, and P. B. Gibbons, “Racod: algorithm/hardware co-design for mobile robot path planning,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 597–609.
  • [51] D. Shah, N. Yang, and T. M. Aamodt, “Energy-efficient realtime motion planning,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–17.
  • [52] S. Lian, Y. Han, X. Chen, Y. Wang, and H. Xiao, “Dadu-p: A scalable accelerator for robot motion planning in a dynamic environment,” in Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6.
  • [53] S. Murray, W. Floyd-Jones, Y. Qi, G. Konidaris, and D. J. Sorin, “The microarchitecture of a real-time robot motion planning accelerator,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2016, pp. 1–12.
  • [54] S. Murray, W. Floyd-Jones, G. Konidaris, and D. J. Sorin, “A programmable architecture for robot motion planning acceleration,” in 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), vol. 2160.   IEEE, 2019, pp. 185–188.
  • [55] S. M. Neuman, R. Ghosal, T. Bourgeat, B. Plancher, and V. J. Reddi, “Roboshape: Using topology patterns to scalably and flexibly deploy accelerators across robots,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13.
  • [56] Y. Yang, X. Chen, and Y. Han, “Dadu-rbd: Robot rigid body dynamics accelerator with multifunctional pipelines,” in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 297–309.
  • [57] S. Lian, Y. Han, Y. Wang, Y. Bao, H. Xiao, X. Li, and N. Sun, “Dadu: Accelerating inverse kinematics for high-dof robots,” in Proceedings of the 54th Annual Design Automation Conference 2017, 2017, pp. 1–6.
  • [58] J. Sacks, D. Mahajan, R. C. Lawson, B. Khaleghi, and H. Esmaeilzadeh, “Robox: an end-to-end solution to accelerate autonomous control in robotics,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).   IEEE, 2018, pp. 479–490.
  • [59] S. Shao, J. Tsai, M. Mysior, W. Luk, T. Chau, A. Warren, and B. Jeppesen, “Towards hardware accelerated reinforcement learning for application-specific robotic control,” in 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).   IEEE, 2018, pp. 1–8.
  • [60] E. Aude and J. Aude, “A hardware accelerator for a robot arm multivariable self-tuning control,” IFAC Proceedings Volumes, vol. 24, no. 7, pp. 73–80, 1991.
  • [61] K. Gac, G. Karpiel, and M. Petko, “Fpga based hardware accelerator for calculations of the parallel robot inverse kinematics,” in Proceedings of 2012 IEEE 17th International Conference on Emerging Technologies & Factory Automation (ETFA 2012).   IEEE, 2012, pp. 1–4.
  • [62] V. Mayoral-Vilches, S. M. Neuman, B. Plancher, and V. J. Reddi, “Robotcore: An open architecture for hardware acceleration in ros 2,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 9692–9699.
  • [63] C. Lienen, M. Platzner, and B. Rinner, “Reconros: Flexible hardware acceleration for ros2 applications,” in 2020 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2020, pp. 268–276.
  • [64] Y. Hao, Y. Gan, B. Yu, Q. Liu, Y. Han, Z. Wan, and S. Liu, “Orianna: An accelerator generation framework for optimization-based robotic applications,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 813–829.
  • [65] C. Rodriguez Vidriales et al., “Universal robots® ur5. desarrollo de programación,” 2020.
  • [66] E. Dantec, R. Budhiraja, A. Roig, T. Lembono, G. Saurel, O. Stasse, P. Fernbach, S. Tonneau, S. Vijayakumar, S. Calinon et al., “Whole body model predictive control with a memory of motion: Experiments on a torque-controlled talos,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 8202–8208.
  • [67] S. Kleff, A. Meduri, R. Budhiraja, N. Mansard, and L. Righetti, “High-frequency nonlinear model predictive control of a manipulator,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 7330–7336.
  • [68] S. M. Neuman, B. Plancher, T. Bourgeat, T. Tambe, S. Devadas, and V. J. Reddi, “Robomorphic computing: a design methodology for domain-specific accelerators parameterized by robot morphology,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 674–686.
  • [69] M. Lee, S. Park, H. Kim, M. Yoon, J. Lee, J. W. Choi, N. S. Kim, M. Kang, and J. Choi, “Spade: Sparse pillar-based 3d object detection accelerator for autonomous driving,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2024, pp. 454–467.
  • [70] S. Krishnan, Z. Wan, K. Bhardwaj, P. Whatmough, A. Faust, S. Neuman, G.-Y. Wei, D. Brooks, and V. J. Reddi, “Automatic domain-specific soc design for autonomous unmanned aerial vehicles,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2022, pp. 300–317.
  • [71] B. Yu, W. Hu, L. Xu, J. Tang, S. Liu, and Y. Zhu, “Building the computing system for autonomous micromobility vehicles: Design constraints and architectural optimizations,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2020, pp. 1067–1081.
  • [72] F. Liu, F. Yan, L. Zheng, C. Feng, Y. Huang, and L. Ma, “Robouniview: Visual-language model with unified view representation for robotic manipulaiton,” 2024. [Online]. Available: https://arxiv.org/abs/2406.18977