Having established that the WRF enhances a user’s reachable workspace volume while remaining within ergonomic biomechanical load limits, we now consider the interaction effects between the user and the robot. During collaborative activities, disturbances are introduced in the robot’s motion plan due to the user’s independent arm movements. In order for the WRF to be an effective augmentation, it needs to be able to counteract these disturbances. In this section, we describe strategies for stabilizing the WRF’s end-effector while it is worn by a user performing close-range tasks.
Two approaches were considered for predicting human motion over this horizon for extended tasks in 3D: an autoregressive (AR) time series model as in [
4], and a recurrent neural network (RNN) model adapted from [
31]. These models take in the poses for the WRF and human, and generate a sequence of joint angle references over the time horizon. Both of these approaches were trained offline using the KIT Whole-Body Human Motion Database [
10] and adapted for online predictive control through the framework shown in
Figure 13b.
5.1. System Identification
As a precursor to the application of human motion prediction for stabilizing the WRF, the dynamic response of its motors were studied in typical usage scenarios, and system identification was performed to recover the in situ motor parameters for Model III. This allowed for the estimation of sensing and actuation delays in the physical system by augmenting a delay term to the linear models, and fitting to data from the motion capture system.
Each of the Dynamixel motors used in the robot have built-in PID controllers, apart from the AX-12A motors for wrist rotation and pitching that only have proportional control. Each motor receives a reference angle
as input from the PC, driving a DC motor plant, with output angle
measured using built-in encoders (
Figure 14a).
The plant transfer function
between voltage
V and output angle
is based on an L-R circuit DC motor model [
32], resulting in a third order system in terms of parameters
,
and
:
During system identification, the PID controller’s transfer function
used manufacturer supplied values for the gains
,
, and
. This resulted in the closed-loop transfer function
between the motor output angle
and reference signal
to be a third-order system with no zeros:
The closed-loop model parameters
and
were fit to the measured output signals using the Simplified Refined Instrumental Variable method for Continuous-time model identification (SRIVC) method [
33]. No explicit delays were assumed in this transfer function since the encoders are built-in to the motors. The plant parameters
,
and
were then obtained from
and
. Each DoF was identified individually, keeping all other motors fixed, and the magnitudes of the step reference input signals were determined from the usage scenarios (e.g., steps of 0.7 rad over 2 s for DoF-1 as shown in
Figure 14b).
The accuracy of the identified system models was evaluated by computing the Normalized Root Mean Squared Error (NRMSE) goodness of fit between the output signals measured by the encoders and the simulated motor model outputs, for the same reference input. The plant parameters and model fitting metrics for each DoF are listed in
Table 3.
Having obtained the open-loop plant transfer function parameters for each of the DoFs in the WRF, we can use augment these models to estimate the sensing and actuation delays in the overall system.
5.1.1. Delay Estimation
The first step in developing predictive models was to estimate the time horizon for predictions over which the WRF’s motors need to be controlled to compensate for sensing and actuation delays. This time horizon
h (
Figure 15a) was determined by system identification using the linear model described in Equation (
10) with a delay term
included:
is the motor response to an input step signal
, reconstructed though the inverse kinematics equations in
Section 3.5 using data from the motion capture system (
Figure 15b). The other terms in the transfer function,
and
, were obtained from the system identification performed earlier by using the parameters in
Table 3 and the stock PID control gains
,
, and
. This allowed for the isolation of system delays in the motion capture and communication channels from the in situ motor dynamics.
The delay
was estimated to be 86 ms using the same SRIVC method as before, averaged across DoFs 1–3 which showed relatively slower responses due to larger loads. This corresponded to a prediction time horizon
h of about 10 time steps for the OptiTrack motion capture system used in this work with a frame rate of 120 Hz [
34].
5.2. Previous Work on Planar Stabilization
In previous work [
4], we had developed an end-effector stabilization strategy for a reduced 2D scenario. The positions of the WRF’s base and end-effector were tracked using fiducial markers and a stereo camera (
Figure 16a) while the user’s arm moved in a periodic manner in the
XY plane with small displacements of ~15 cm from an initial position at frequencies typically less than 1 Hz.
Using the identified linear system models through the procedure described in
Section 5.1, the step response charactersitics were estimated for the DoF-1 and DoF-3 motors (
Table 4). In particular, the bandwidth for both motors was found to be above 1 Hz, which should have been sufficient to stabilize the WRF against small, planar human arm motions through a direct feedback control strategy (
Figure 13a). However, this performance was affected by delays in sensing and actuation.
After estimating these delays using similar linear models (
Section 5.1.1), an autoregressive (AR) time series model for human arm motion was developed to determine the joint angle reference signals for DoF-1 and DoF-3 using the approach shown in
Figure 13b to stabilize the end-effector in 2D. Compared to a direct feedback control approach, the AR model helped reduce position errors by 19.4% in
X and 20.1% in
Y (
Figure 16b).
Related work in this domain includes stabilization of SR limbs using a time-series model of the forces and torques due to the wearer’s change in posture [
35], as well as modeling of hand tremors as Fourier series for tool-tip compensation in a handheld surgical device [
36]. This literature informed the choice of AR models for predictive control of the WRF, both in [
4], as well as being applied to the full 3D case here.
5.3. Human Motion Prediction
The estimated system time delays for the WRF served as prediction horizons for the human motion prediction models for end-effector stabilization. The criteria for these models were real-time (or close to real-time) prediction with optical motion capture data, and good performance over the required controller time horizon in close-range tasks.
Two methods were utilized for this purpose: an autoregressive (AR) time series model, and a single-layered gated recurrent unit (GRU) adapted from [
31] and modified for real-time performance. Both of these models were trained offline using the KIT Whole-Body Human Motion Database [
10], available at [
37]. It consists of a wide selection of task and motion scenarios, with annotated recordings from optical motion capture systems, raw video, as well as auxiliary sensors (e.g., force plates). For this work, we utilized labeled human skeleton marker data (
Figure 17) from nine tasks in the database that involved periodic movement of the subject’s right arm. They are listed in
Table 5 along with the number of trials performed for each task, and the total number of data points with human right arm movements extracted from all trials.
The full-body skeleton marker set consists of 56 points, out of which 10 are relevant for prediction of human right arm motion, with the positions on the body determined by a weighted sum of the individual 3D positions of the markers (
Figure 17b): 3 for the clavicle (
C), 3 for the shoulder (
S), 3 for the elbow (
E), and 4 for the wrist (
W).
Three relative position vectors were generated from the four body points: , , and . This allowed for prediction of movements of a particular body segment independent of its previous neighbor, and improved the training accuracy of the models.
5.3.1. Autoregressive Time Series Model
As in [
4], the time series model started with the initial assumption of an Autoregressive Moving-Average (ARMA) process:
Here is a discrete univariate series, composed of a constant term C, past terms weighted by coefficients for lag k (AR term), and past white noise terms weighted by the coefficients . The number of past terms, p and q determine the orders of the AR and MA parts, respectively.
Each component of the relevant body vectors
,
and
, was considered to be an independent univariate series. The stationarity of these series was verified with augmented Dickey–Fuller hypothesis tests [
38].
The autocorrelation (
) and partial autocorrelation (
) functions at lags
k were computed for these series. There were sharp drop-offs in
compared to
over successive lags for each component of the body vectors, illustrated
Figure 18 for the
X component of
. This indicated that the ARMA processes could be simplified into purely autoregressive (AR) models [
39]:
The model order
p for each of the nine components in the body vectors was determined using the Akaike Information Criterion (AIC), a maximum-likelihood measure of the goodness of fit [
40]. The AIC was computed for model orders up to 30 for each of the nine series, and the one with minimum AIC was selected as
p for that series. The minimum AIC values were obtained at different model orders for each series, ranging from
p = 18 to
p = 25. The model parameters
,
C, and
were determined using the Yule–Walker method [
41], trained on the task motions listed in
Table 5.
5.3.2. Recurrent Neural Network Model
While an AR model is able to forecast human motions through local predictions, it does not capture dependencies over a longer time period, or encode structural information about the correlations between body components over time. To account for these factors and improve on the predictions from the AR models, we used a recurrent neural network (RNN) model for human arm motion prediction, and compared the performance between the methods.
Independent of robotics, RNNs have been applied extensively for human motion prediction, including architectures with Long-Short Term Memory (LSTM) cells [
42], and structural RNNs that encapsulate semantic knowledge through spatio-temporal graphs [
43]. These approaches include multiple recurrent layers as they are aimed at offline prediction of the entire human skeleton, and task classification in general motion scenarios. As the task scenarios for WRF stabilization involve periodic motions and require prediction of only the wearer’s arm, we used a simpler model with a sequence-to-sequence architecture [
44] and a single Gated Recurrent Unit (GRU), as proposed in [
31], which also includes a residual connection for modeling velocities. Compared to an AR model, this resulted in higher prediction accuracy of human arm motion, and improved the end-effector stabilization in most task scenarios.
The schematic of the RNN model is shown in
Figure 19a. It consists of an encoder network that takes in a 9-dimensional input of the body vectors,
, 50 frames at a time from the KIT database or motion capture system, and a decoder network that converts the output from a single GRU cell with 1024 units into 9-dimensional predictions over
k steps. Based on the estimated system delay, we set
k = 10, and the learning rate to be 0.05 for batch sizes of 16, as specified in [
31] for predictions up to 400 ms. This RNN model was trained on the KIT Database motions listed in
Table 5, and converged at about 5000 iterations, as shown in
Figure 19b with Mean-Squared Error (MSE) losses.
5.3.3. Model Evaluation
Both models were evaluated on the relevant motions from the KIT Database listed in
Table 5. They were trained offline using all but two trials for each task, with one of remaining trials serving as the validation set, and the other as the test set. The training set was expanded to four times its original size by adding Gaussian white noise with standard deviation 1 cm to each of the nine components of the body vectors, leading to 89,864 data points for training. The test and validation sets had 18,922 and 15,042 data points, respectively.
The Root-Mean-Square (RMS) prediction errors were computed on the test set for both models, and are listed in
Table 6. While the RNN model did not improve upon the AR model for every component, it reduced the prediction errors in the components with the worst performance using AR (
Figure 20). The RNN model also performed better overall, with an average RMS error of ~0.90 cm, compared to ~1.25 cm for the AR model.
Figure 21 shows that while the RNN model tended to overshoot the ground truth, and be offset from it, the tracking of overall motion trends was better than the AR model.
5.4. Implementation on the WRF
Having obtained two predictive models for human arm motion that performed well on the KIT Database, they were applied for stabilization of the WRF’s end-effector at an initial pose when subjected to disturbances due to movement of the user’s right arm. For validation of these models, we considered five task scenarios, shown in
Figure 22, that involved periodic arm movements of relatively small magnitude—(a) tracing a line of length 10 cm, (b) tracing a circle of diameter 10 cm, (c) wiping a desk top, (d) painting with small brush strokes on a canvas, and (e) placing ten objects into shelves of a table-top drawer unit. Each task was performed for ~5 min, with each iteration lasting between 5 s (for tacing lines) and 30 s (for placing objects) depending on the complexity of the task. The initial end-effector pose was selected to be on the right of the user and below them, so as to not impede the task.
Optical markers were placed on the user’s right hand and elbow, as well as on the WRF’s end-effector and near the DoF-1 motor (
Figure 23).
These markers were tracked at 120 Hz using an OptiTrack motion capture system [
34]. The raw marker position data was smoothed and filtered using an IIR low-pass digital filter with transfer function coefficients for 6 Hz normalized cutoff frequency [
45], following the techniques discussed in [
46,
47].
In all the scenarios shown in
Figure 22, the body vector
was assumed to be constant in each task, as the human shoulder and torso remained almost stationary at their initial positions. The other relevant points,
B (base position of the WRF), and
R (position of the end-effector), to be tracked are shown in
Figure 23, We aimed to keep the end-effector static at the initial point
at the start of each task. If the user’s arm were to move, the end-effector would also move by an amount
at time
t. To generate appropriate setpoints for the WRF’s motors,
is converted from the a global frame
G (fixed lab frame) to the robot’s base frame
B. Using the convention
for the homogenous transformation of the pose of frame
B as seen in frame
A, we need to convert from
to
. Using the elbow frame
E as an intermediate,
The transformation between the robot base
B and elbow
E is constant, while the transformation
consists of two variable parts: the rotation matrix
between the elbow and ground frames, and the position of the elbow,
which is tracked directly by the motion capture system.
is the rotation matrix that takes the unit vector along the local X-axis,
, and aligns it with the unit vector along the human forearm,
, in the ground frame. Using the approximate method for position-only inverse kinematics (Jacobian pseudoinverse) discussed in
Section 3, the change in WRF joint variables can be determined:
At time
t, this gives the desired setpoint reference for each motor used for direct feedback control:
Following the procedure shown in
Figure 13b, the predictive models were used to generate setpoint references over a time horizon of ~86 ms for each motor in the WRF:
For a stereo camera frame received at time t, a sequence of k = 10 joint angle references were sent to each motor, with ms, . As described above, is the desired joint angle in direct feedback control, computed using inverse kinematics for the detected human and robot poses at time t. The predictions from the AR and RNN models are represented as residuals added to .
During implementation, it was found that the AR model could generate predictions nearly in real time, though requiring a few seconds of sensor data collection to initialize the predictors at the start of each task. In comparison, the RNN model had lags of up to ~50 ms due to computational bottlenecks when predicting over the specified time horizon. To account for these lags, the pre-trained RNN model was executed in parallel with the AR model. Until a prediction was received from the RNN model, the AR prediction was used for computing . Depending on the amount of lag, determined through time stamps, a corresponding number of RNN predictions were discarded (typically the first 5–6 terms), and the remaining ones were added to the sequence to be sent to the motors.
This implementation of human motion prediction (RNN + AR) reduced the mean error in end-effector position by up to ~26 % over direct feedback control, while the AR model alone was able to improve upon direct feedback control by up to ~19 %, as listed in
Table 7.
Figure 24 shows that the performance of all three control methods varied according to the task, with more structured and periodic motions such as tracing a line and circle showing better stabilization performance compared to motions with less structured or periodic behavior such as stowing items into a drawer.