1. Introduction
Remote sensing research plays a crucial role in various human analysis fields, such as detection [
1], tracking [
2], path planning [
3,
4], and group re-identification [
5] using UAV aerial imagery data. The highD dataset demonstrates that UAVs are essential for capturing the trajectories of traffic participants, including humans [
6]. UAVs offer the advantage of recording naturalistic behavior in large-scale environments through camera-equipped drones. Pedestrian movement analysis and UAV-based remote sensing research are closely intertwined, as both fields leverage spatiotemporal data to enhance the understanding of human movement patterns and environmental interactions from an elevated perspective. The former enriches the latter by providing ground-level insights that improve the calibration and interpretation of aerial data, ultimately enhancing the precision of spatial analyses [
7,
8,
9]. The data collected from UAVs are critical for predicting human motion, which is of great importance for self-driving vehicles in addressing intelligent obstacle avoidance challenges posed by surrounding moving objects, such as autonomous robots [
10,
11], pedestrians [
12,
13,
14], and vehicles [
15,
16].
Human movement is inherently stochastic, influenced by a variety of factors, including individual behaviors and environmental conditions, e.g., a latent decision for a long-term goal or a random decision against environmental changes. Thus, the future trajectory of pedestrians could be influenced by behavioral and stochastic factors; the former includes social interactions (e.g., avoiding collisions, following others, or walking together), and the latter includes scene semantics (e.g., walls, barriers, or traffic), dynamic obstacles (e.g., moving vehicles, cyclists, or other pedestrians), etc. This complex phenomenon introduces a multimodal challenge, marked by significant uncertainty. As a result, pedestrians can exhibit a range of plausible future trajectories with multimodal predictions.
However, humans are target-oriented agents who actively express their intentions through actions to achieve desired targets. Therefore, analyzing this uncertainty requires a comprehensive understanding of pedestrian movement patterns in the context of environmental factors. The multimodal uncertainty may bring about the influence of cognitive consciousness on pedestrian motion, avoidance of interactions with others, or sharp turns by other individuals that lead to changes in the path. However, existing methods fail to adequately address and model this multimodal uncertainty within a solid theoretical framework.
Motivated by the long-term target-oriented principles underlying human motion, we integrate historical trajectories and scene semantics into our approach. The inherent uncertainty in trajectory prediction is addressed through a binary classification framework, which combines potential long-term goals inferred from the historical behaviors of pedestrians, as well as environmental factors and path selection variables influenced by their habits. Specifically, this framework considers habitual factors derived from pedestrian historical trajectories and stochastic decision factors arising from environmental conditions and individual preferences. Therefore, the multimodal uncertainty analysis of trajectory targets extends the predictive horizon for pedestrian trajectories.
On the other hand, since the pedestrian trajectory prediction is time-dependent, we use the energy diffusion principle to explain its uncertainty. The future distribution of particles is considered in a thermodynamic framework. Thus, under high-uncertainty conditions, particles (positions) are randomly distributed across all walkable regions. While under low-uncertainty conditions, particles aggregate and deform into a clear trajectory.
Due to the temporal correlation of the pedestrian trajectory prediction, we propose employing the energy diffusion principle to elucidate its inherent uncertainty. In our conceptual framework, we analogize future pedestrian positions to particles within the domain of thermodynamics. Therefore, our approach aims to learn from this diffusion process by gradually decomposing the uncertainty of trajectory prediction into interpretable factors and converting the ambiguous prediction regions into diverse deterministic trajectories of multi-modality. Then, we design a global and comprehensible deep learning network to fuse the information from humans and their surrounding environment and factorize the multimodal problem into some explicit factors to analyze. For the framework of deep learning networks, we use the U-Net network [
17] as the backbone sub-network, as it is conducive to semantic segmentation. Due to the symmetrical structure of the U-Net, our designed overall network also exhibits symmetry, which facilitates the global representation and interpretability of the trajectory encoding and decoding sub-networks.
In this paper, we propose a symmetrical U-Net architecture to learn both the stochastic and behavioral factors in pedestrian trajectory prediction through global thermal diffusion analysis. First, we perform semantic segmentation on the scene map to identify walkable areas, which will serve as the stochastic factor in our prediction framework, representing the environmental input within our prediction framework. Next, one U-Net branch is employed to learn semantic information, identifying plausible path nodes and target points. We argue that, for a given scene map, the potential movement targets and path nodes are static and unaffected by the agent’s historical movements. Therefore, for the path node and target encoder, we use only the processed scene graph as input, which significantly improves training efficiency (as demonstrated in
Section 4.6). In parallel, an additional U-Net branch is used to decompose the behavioral factor by learning from past trajectories. Specifically, the trajectory branch integrates both historical trajectory data and scene information. This is achieved by combining the processed scene map with the trajectory heatmap, which is then fed into the trajectory encoder. Furthermore, the output from the node and target heatmaps helps the trajectory decoder make a global estimation of the trajectory distribution for a future time.
Therefore, we design a novel neural network architecture named LSN-GTDA by symmetrical U-Net [
17] sub-networks, which not only reduces the estimating error significantly but also markedly reduces the training time needed to achieve a convergent model. There are two input ports in the LSN-GTDA framework corresponding to the scene heatmap encoder and the trajectory encoder, respectively.
In addition, to factorize the inherent uncertainty of human trajectory into some explicit factors, we apply a novel signal and system-based thermal diffusion process using a complete response mechanism in the proposed symmetrical network. Specifically, we refer to the energy diffusion of the complete response mechanism, which can analyze the historical part and the stochastic part in the future in terms of human movements. Hence, if we take the start of the predicting time horizon as the zero-moment, the diverse motion patterns arising from the two uncertainty factors correlate with the corresponding parts of the complete response mechanism in the field of signal and system, then they can be analyzed by the energy diffusion principle based on the complete response. Therefore, we can use the input of the semantic information and historical trajectory from the symmetrical network to conduct the diffusion process. Through the above analysis, the operation of the symmetrical network is explicable.
In total, the contributions in this paper are threefold and are listed as follows:
A novel symmetrical U-Net-based framework is proposed, which integrates pedestrian historical trajectories with scene semantic segmentation information. Through a bottom–up global analysis of trajectory nodes, paths, and targets, the framework enables multimodal long-term predictions that leverage historical patterns and account for future stochastic uncertainties.
A global thermal diffusion mechanism is introduced for the symmetrical network, utilizing the global concept of a complete response mechanism. This approach enhances the interpretability of the model, addressing uncertainty within the network and providing a faster inference efficiency and deeper understanding of the underlying principles of human trajectory prediction.
Extensive experiments are conducted on human trajectory prediction in diverse UAV scenarios, demonstrating competitive accuracy and efficiency. The results show that our approach outperforms state-of-the-art baselines.
3. The Proposed Method
Generally, as shown in
Figure 1, the main uncertainty affecting the human trajectory prediction can be factorized into the past behavioral influence and the future stochastic factors, which is a multi-modality problem. Therefore, we design the human trajectory predicting framework as follows.
Let denote the 2D coordinates of a pedestrian in the image scene, and the sequence of the observed past positions is denoted by during the past period of seconds, where defines the frame rate. LSN-GTDA is designed to forecast positions of the human in the subsequent temporal period, e.g., the position in the next seconds, where .
The future predictions of human motion are uncertain. Pedestrian trajectory prediction involves multiple possibilities; there are multiple patterns for both the destination and the optional path, which makes the prediction a multi-modality problem. Hence, we divide the overall uncertainty into two factors: the behavioral factor and the stochastic factor. We define the number of multiple predictions of the final trajectory target as
, which depends on the behavioral uncertainty. In addition, the number of multiple predictions of the path access to a target is defined as
, which reflects the stochastic uncertainty. In the short-term prediction case,
is usually small, and
is more dominant since the path diversity is limited by the short time horizon; e.g., the total predicted trajectories are the same as
if
. In the long-term prediction case, both of them dominate since there is more uncertainty in the long-term horizon, and moderate diversity of the predicted path (
) can reduce the uncertainty, which will be evaluated in
Section 4.5.1.
3.1. Thermal Diffusion Process
To analyze the proposed framework in
Figure 2, we divide this network into two parts: the processing of past trajectories of pedestrians and scene maps. The former can be regarded as the system’s input due to its strong variability; i.e., the historical trajectory of each pedestrian can vary greatly. The latter, on the other hand, can be considered the inherent state of the system due to its spatial invariance, where the path points and target points generated in the same scene remain unchanged, or it remains invariant due to the change of the input. We consider the current prediction moment to be the zero-moment, and the system’s output is our prediction for the pedestrian trajectory.
Human trajectories exhibit collective movement patterns by a series of waypoints, which reflect the behavior of human gaits. To find precise and coherent motion patterns within the trajectory, the optical flow field [
46] can be used to construct the global motion correlation of the waypoints in the trajectory. Specifically, each waypoint is regarded as a heat source node and is connected with neighboring nodes through energy diffusion in the optical flow field [
47], which models the inherent relationship between these waypoints of a trajectory.
3.1.1. Behavior Consistency in Neighboring Nodes of the Trajectory
At first, according to the neighboring similarity in [
34], the similarity between a pair of neighboring path nodes
i and
j can be defined as
where
measures the correlation of velocity between pair-wise nodes at time
t.
reflects the consistent behavior of an individual among its neighbors. To further describe the correlation for the waypoints that are not adjacent, we propose a graph structure by the connectivity of its associating nodes.
Let
be the weighted adjacency matrix to describe the correlation in a graph, which is used to associate with the node set
of the trajectory. The similarity
is an edge between the pair-wise nodes in Equation (
1). Let
denote a trajectory with length
h by nodes
on
between individual nodes 0 and
h. Then, the probability of the real trajectory for path
can be defined as
Since there might be more than one path between the node
i and
j, let the set
contain all paths of length
h between them; then, the
h-path similarity is represented as
According to Theorem 1 in [
34],
can be efficiently computed by matrix
since it is the
entry of
.
3.1.2. The Thermal Diffusion Process by an Analysis of the Complete Response Mechanism
Based on the above discussions, a further description of the thermal diffusion process among the nodes of the path will be described in this section. Since the
h-path similarity in Equation (
3) measures the behavioral consistency of pair-wise nodes, each node in the path can be treated as a “heat source” that diffuses energy to other nodes; thus, we can refer to a classical idea from [
48] that models the thermal diffusion as follows:
where
is the thermal energy for the particle at node
after performing the propagation of thermal energy for
t seconds, and
is the propagation coefficient.
is the input motion vector for node
. Without
, Equation (
4) can be solved by [
37] as follows:
where
is the final diffused thermal energy for the waypoint after
t seconds,
is the set of all waypoints in the trajectory, and
H and
W are the height and width of the input image. Motivated by the complete response theory from the discipline of Signals and Systems [
49], the first term in Equation (
4) models the diffusion of thermal energies over free space so that the spatial correlation among path nodes can be properly enhanced during energy propagation, which can be viewed as the zero-state response if the initial state of the thermal diffusion process is supposed to be none at the zero-moment. As [
37] mentions, after
T seconds, the spreading energy from
to
is
where
and
are the propagation coefficient and the force propagation factor, respectively.
denotes the axis.
is the optical flow that describes the motion tendency at position
,
is the current motion pattern of
, which is initialized by
and its thermal energy in location
j is
[
37]. However, the energy diffusion strategy of [
37] ignores the initial motion state
; due to the lack of consideration of the initial state before the pair-wise particles’ energy diffusion, i.e., the zero input state, its mechanism is too limited to handle the global prediction issue. Therefore, we introduce a global formulation that adds the zero-input response to the thermal diffusion process. Let
be the total motion energy at location
; it consists of two parts. One is the zero-state response
, and the other is the zero-input response
. We consider the pedestrian’s current position as the zero-moment in the global system analysis to predict the subsequent waypoints they may pass;
represents the uncertainty of the stochastic factor in the future, whereas
considers the epistemic factor before the zero-moment. Mathematically,
Equation (
7) is obtained by simplifying Equation (
6). Since
i and
j are adjacent nodes, we assume the length of
in Equation (
6),
and
are two coefficients. To consider the real road environment, the background scenario limits some roads, which can be passed, while some parts cannot priorly, which is the zero-input response that indicates the historical information such as previous trajectories. On the other hand, the zero-state response focuses on the stochastic factor including the random motion choice of the pedestrian by the background map after the zero-moment. According to Equations (
2) and (
3), the energy associated with each node signifies the likelihood of pedestrian passage, thereby influencing the heatmap distribution of the subsequent adjacent node through thermal diffusion. Greater energy levels indicate a higher likelihood of passage. The optimal path can be inferred through the integration of the global thermal diffusion process outlined above.
We factorize the whole uncertainty into two explicit parts, which can be analyzed by the two components of the complete response mechanism correspondingly. In this practical approach, categorizing the uncertainties within the model helps identify which uncertainties can potentially be reduced, thereby enhancing the interpretability of the prediction study. Hence, to validate the interpretability of the proposed method, we analyze the effectiveness of the two proposed uncertainty factors: the stochastic factor and the behavioral factor. We propose the “Target Modality Sampling Strategy” (TMSS) to give a specific number of target samples needed for evaluation. TMSS investigates the impact of varying modality sampling numbers on the outcomes within a specified target quantity framework, which reflects the behavioral factor with the zero-input response in the proposed thermal diffusion process. TMSS uses the K-means method to directly control the sampled modalities and the number of cluster centers. Therefore, we use TMSS, which is cognizant of the number of the target samples, , needed in the evaluation procedure.
In addition, the target and path are dependent on each other; we use another trick named “Path and Node Modality Sampling Strategy” (PNMSS) to analyze the function of the stochastic factor by different path modalities for a given target such as Equation (
3), which reflects the impact of zero-state response in the proposed thermal diffusion process.
Therefore, it is useful to build the complete response to describe the thermal diffusion model by zero-input and zero-state response, which provides theoretic interpretability for the design of the following proposed deep learning-based network.
Figure 3 shows visual representations of TMSS and PNMSS; the target and the path node are dependent on each other, where the road forks into three different paths. If the sampled target is located at the bottom, the pedestrian tends to move downwards rather than upwards in the map, which results in not passing the upper waypoint in
Figure 3b.
According to the above analysis, we propose a hierarchical prior information to model the process. After the targets are sampled, we fix the target to constrain the possible positions of the next path node. We assume the node lies on a straight line segment between the sampled target and the past trajectory. Then, we use a multivariate Gaussian prior with means at the assumed position to relax the assumption, which is multiplied pixel-wise to the distribution of the predicted path node. The thermal diffusion is represented by the distribution. Fusing the prior and the predicted distribution can allow for the obtention of scenario-compliant path nodes in a feasible direction. Finally, we use softargmax operation [
50] for the first node and sample the remaining nodes randomly to obtain 2D points from the distribution. We repeat the above process for the next node to realize the thermal diffusion in the path generation.
In summary, as shown in
Figure 4, TMSS provides possible motion targets by using the existing historical motion state, which represents the zero-input response in the diffusion process. PNMSS provides feasible paths for a specific target by using diverse input factors regardless of the initial state after the zero-moment, which represents the zero-state response in the diffusion process. The proposed thermal process combines them for the purposes of global analysis to consider the diversity of the target and the path.
3.2. Symmetrical LSN-GTDA Human Trajectory Forecasting Structure
As shown in
Figure 2, LSN-GTDA is a symmetrical network that comprises two U-Net-based branches. The above branch analyzes the behavioral uncertainty by learning the semantic information of input scenes to estimate the optional target distribution, and feeds together with the scene heatmap encoder to the nether branch to enhance the decoding of the trajectory. The nether branch analyzes the stochastic uncertainty by learning from the past trajectory and walkable area for prediction. We further use the zero-input response and the zero-state response to estimate the heatmap distribution for the two branches, respectively. Finally, we predict the global future trajectory from the heatmap.
Generally, trajectory points may be randomly distributed across the entire space under high uncertainty. By using the proposed global thermal diffusion process, the diffused global energy can make the predicted trajectory more focused through its heatmap and thus reduce uncertainty.
To understand the scene of the pedestrian effectively, semantic information is exploited as an aided and useful clue together with the trajectory prediction module. Therefore, we formulate the deep learning network of human trajectory prediction as two modules: One is the semantic segmentation module, and the other is the trajectory forecasting module.
For the semantic segmentation module, we use a segmentation network by U-Net to obtain a segmentation map
of scene image
I, which includes
classes of behaviors such as standing, walking, running, etc. On the other hand, the previous action history
of the agent
h is transformed to a trajectory heatmap
of
channels that comprise spatial sizes
I; they correspond to the past
seconds sampled at the frame rate. The trajectory heatmap is defined as
To add interactions and constraints such as avoiding obstacles for the agents with the given scene, we pretrain the semantic segmentation model to use the sparse scene image data efficiently. The U-Net model [
17] and ResNet101 [
51] backbone are used in our implementation. The ResNet101 encoder’s weights are pretrained on ImageNet, while the weights for the segmentation head and U-Net decoder are randomly initialized. The images are downsampled four times and padded to be divisible by 32 as required for U-Net; then, the size is cropped to 256 × 256 pixels. The SDD segmentation model is trained using the ADAM optimizer.
The semantic segmentation module is crucial for understanding the environment in which pedestrians navigate. By segmenting the scene into different regions based on semantic categories (e.g., sidewalks, roads, crosswalks, buildings), this segmentation helps the model distinguish between walkable and non-walkable areas, enabling more realistic predictions of pedestrian movement.
Then, we concatenate the trajectory heatmap with the semantic map, which represents the probability distribution of where a pedestrian is likely to move in the future. Different from [
18], the proposed trajectory feature and semantic feature are encoded independently. Therefore, the channel dimensions of the semantic encoder and the trajectory encoder are
and
, respectively.
The semantic segmentation module informs the trajectory heatmaps by providing a contextual understanding of the environment. These two components create a dynamic and context-aware framework for predicting pedestrian movement. While the semantic segmentation module ensures that the prediction is grounded in the real-world environment, the trajectory heatmaps enable the model to predict not just the location but the future path distribution, capturing the inherent uncertainty in human motion.
In this way, the semantic information can be introduced along with the channel dimension to produce semantic heatmap representation, which is passed to the encoder as its first input channel. The overall network is symmetrical in terms of the structure of the encoder and decoder parts.
3.3. Encoder of the Trajectory Module
In terms of the trajectory module, we consider two aspects of its function. First of all, the observed scene heatmap tensor
is processed with an additional U-Net encoder
rather than a common U-Net encoder that is shared by the semantic module and the trajectory module. In this way, the U-Net encoder [
17] consists of
blocks with max pooling and increased channel depth correspondingly. In this way, the final deep representation
and
intermediate feature tensors are spatially compact, and various spatial resolutions are provided to the subsequent goal decoder.
Moreover, as shown in
Figure 2, the proposed trajectory module is connected with the output of the semantic segmentation decoder
, which propagates useful semantic information to the proposed trajectory module and then assists the module in paying more attention to the region that contains goals that are more possible to achieve, while ignoring other unimportant regions. Pedestrians are always drawn to destinations; therefore, we use goal attraction from additional semantic information in order to guide the trajectory to the goal.
3.4. Decoder
3.4.1. Target and Node Heatmap Decoder
The main role of this decoder is to generate useful clues about the target and path node for human motion. As we have raised the size of the feature map through the blocks in the encoder, many details are unavoidably lost during the convolution and pooling operations of the encoder. Thus, we need to merge the symmetry features from the LSN-GTDA encoder, which can bring apparent recovery in the resolution. Hence, they are combined as an input terminal to the trajectory decoder, and we use the expansion arm of the scene heatmap encoder as a reference for the decoder. Then, the heatmap provides a visual way to represent possible targets by mapping predicted probabilities or densities for the trajectory decoder, which handles the uncertainty in multi-modality predictions. In total, the heatmap illustrates uncertainty and variability, while target nodes provide concrete predictions.
3.4.2. Trajectory Decoder
The trajectory decoder is similar to the target and node heatmap decoder. The difference is that we also need to gather the information on node and target points with the branch of the segmentation map and past trajectory heatmaps. Thus, an extra input of the U-Net’s expansion arm [
17] is added, which is the output of the Node and Target Point Decoder. To fit the diverse resolutions, we down-sample the node heatmaps for different decoder layer requirements to match the size of the input information. After the final per-pixel sigmoid processing, we obtain a probability estimation of human trajectory in the next
seconds.
3.5. Loss Function
Due to the trajectory being used in scene representation, we design losses to reflect the discrepancy between the estimated probability distribution and the ground truth (GT). Since human motion is inherently goal-directed, it exerts its will by actions to realize a desired effect [
28]. Therefore, a Gaussian heatmap can be predetermined based on previous probabilistic models such as prior variance, which can reduce the unnecessary interference caused by the uncertainty of the goal through the inherent goal-driven property of humans. GT is modeled as a Gaussian heatmap
, which is centered at the observed locations with a predefined variance. To avoid the misguidance caused by an inappropriate variance, it is chosen adaptively by the human’s last observed position and the sampled multimodal target; the ground truth
also represents the energy distribution according to the thermal diffusion process in Equation (
4). The energy increases with probability and decreases with increasing uncertainty. We use the ground truth to train the Node and Target Decoder by designing the corresponding loss function.
Since KL divergence is capable of measuring the differences between various probability distributions and real distribution, and given that the pedestrian trajectory shows continuity in the 2D scene, it is not appropriate to use the cross-entropy loss function that is suitable for discrete classification. Instead, we utilize KL divergence to construct the loss function, which consists of the losses between the real distributions
and the estimated distributions
of the target, node, path distributions, and diffusion process, as follows:
Then, we obtain the overall loss function according to the above equations:
where
,
,
, and
control the trade-off between different loss terms.
4. Experiments
To validate the performance of the proposed method, we use two benchmarks to compare LSN-GTDA with other state-of-the-art methods, e.g., the Standford Drone Dataset (SDD) and the ETH-UCY Dataset. The former is used to evaluate both short-term and long-term predictions in UAV scenarios, while the latter is mainly used to evaluate short-term predictions.
4.1. Datasets
Standford Drone Dataset (SDD): The dataset includes more than 11,000 independent pedestrians across 20 top–down scenes captured on the Stanford University campus in bird’s eye view using a flying UAV. There are over 40,000 agent–scene interactions in the dataset, and it has been widely used in the trajectory prediction literature in short temporal horizon settings. For short-term prediction, to show consistency with the other baselines, we follow the same setting with the Y-Net [
18] baseline to make comparisons, with samples at FPS = 2.5 to obtain an input sequence of length
(3.2 s) and output of length
(4.8 s). Moreover, for long-term settings, the raw dataset is split in the same fashion as proposed in the TrajNet benchmark [
52], evaluating the same scenes, all of which are not seen during the training stage. The long-term setting downsamples the data to FPS = 1 and obtains an input sequence of
s and output of length
s.
ETH-UCY Dataset: Many previous methods such as [
21,
22] use several sub-datasets from ETH [
53] and UCY [
54] to train and evaluate their models with the “leave-one-out” strategy, which is called the ETH-UCY dataset. It contains 1536 pedestrians with five different scenes, including thousands of non-linear trajectories and annotations of the pedestrians’ positions in meters. We follow the parameter setting that is the same as that of SDD in the experiment of the ETH-UCY dataset. The ETH-UCY dataset is a widely used benchmark for pedestrian trajectory prediction, consisting of several sequences collected in urban environments. Here is a brief introduction to each of the sequences: (1) ZARA1 is collected in a busy street in Zurich, which captures pedestrians in a densely populated area, showcasing a variety of walking patterns and interactions. The camera is stationary. (2) ZARA2 describes another area in Zurich, similar to ZARA1 but with different environmental settings; it adds more variety in terms of the scene and pedestrian density. (3) UNIV records human movement on the university campus in Cyprus, including students and staff moving in and out of university buildings, which exhibits a mix of pedestrian interactions. (4) HOTEL records pedestrians moving towards and away from a hotel in Zurich, highlighting behaviors in a more controlled environment with varying pedestrian densities. It includes various interactions, such as people waiting and entering the hotel. (5) ETH captures pedestrian traffic around the university’s main entrance of the ETH Zurich campus. It features various paths, including crossing streets and moving towards entry points to show the interactions and movements of pedestrians.
4.2. Evaluation Metrics
In the experiment, both the established Average Displacement Error (ADE) and Final Displacement Error (FDE) are used as the metrics to evaluate the performance of future forecasting. The former reflects the average
error between the forecasted future and the ground truth over the entire trajectory, while the latter calculates the
error between the forecasted future and the ground truth for the final predicted point [
55]. In terms of multiple future predictions, we follow the prior works [
18,
22] to report the final error as the minimum error over all predicted future scenarios. Specifically, ADE and FDE are defined as follows:
where
defines the previous positions during past
frames and the predicted positions during subsequent
frames of the
th pedestrian, which constitutes the predicted trajectory.
N is the number of predicted persons, and
and
are the ground truth and predicted position, respectively.
4.3. Short-Term Predicting Results
4.3.1. Results of Stanford Drone Dataset
As shown in
Table 1, we follow the setting of [
18] to make fair comparisons, and the results of short-term predicting are presented with the setting of
s and
s. Since there are limited random probabilities in the short-term predicting case, we set
and compare the proposed LSN-GTDA method with other state-of-the-art baselines.
Table 1 shows our method for realizing an ADE of 5.98 and an FDE of 6.80, which improves the previous strong baseline such as the NSP-SFM [
35] method by 8.3% on ADE and 35.9% on FDE. It also outperforms some latest state-of-the-art baselines, e.g., E-
-Net-SC [
42] by 8.6% on ADE and 34.4% on FDE. Previous works [
35,
42] are usually limited by context awareness and the absence of interaction modeling, while the proposed thermal diffusion-based method captures multimodal uncertainty by not only considering the trajectories of neighboring nodes but also incorporating environmental factors.
For the sake of comparison with published studies about the trajectory prediction on the diffusion process, e.g., MID [
60] and LED [
45], the results of
Table 1 have validated that the proposed LSN-GTDA outperforms them with a large margin, which validates the superiority of the proposed global thermal diffusion analysis. Moreover, since LSN-GTDA uses the uncertainty factorization and complete response mechanism rather than a denoise-based diffusion process, it overcomes the limitation of expensive time consumption compared with MID. The inference time analysis is shown in
Section 4.6.
4.3.2. Results of the ETH-UCY Dataset
In addition, to validate the generalization of the proposed method, we also report the visualization results of ETH-UCY in
Figure 5, the predictions in different scenarios of ETH-UCY are quite consistent with the ground truth, which is much better than the state-of-the-art baseline, i.e., E-
-Net-SC [
42]. Although the scenarios in the ETH-UCY dataset are complex, the proposed LSN-GTDA estimates the motion tendency under diverse environments.
4.4. Long-Term Predicting Results
To evaluate the effect of
and
uncertainty, we propose a long-term trajectory forecasting setting with a stable prediction duration of 60 s in
Figure 6, which is much longer than other prior baselines.
Table 2 reports our results on the Standford Drone dataset (SDD) for a time horizon of
s of prediction in the future with
s input.
s means the temporal midway between the observed inputs and the longest estimated goal in
Figure 6. The long-term results of LSN-GTDA are at the
setting for a fair comparison with other methods in
Table 2 and
Figure 6. R-PECNet is trained by a recurrent short-term PECNet model [
20]. On the Stanford Drone Dataset (SDD), the proposed LSN-GTDA method outperforms the state-of-the-art method on the long horizon setting as well; e.g., it improves Y-Net in terms of ADE/FDE performance from 47.94/66.71 to 31.91/41.45 with a large margin (33.4%/37.9%). Furthermore, the proposed LSN-GTDA method promotes the performance of PECNet by decreasing ADE/FDE by 55.8%/64.9%.
Moreover, we promote the prediction horizons to one minute and observe the performance results in
Figure 6. The ADE errors of all methods increase with the prolongation of prediction time, and the performance of the proposed LSN-GTDA method and the Y-Net method are significantly better than PECNet in terms of long-term prediction, which indicates the importance of building target and node models in long-term trajectory predictions. In addition,
Figure 6 also shows that the proposed LSN-GTDA method outperforms the Y-Net method.
4.5. Ablation Study of the Proposed Thermal Diffusion Process
To conduct the ablation study for the proposed thermal diffusion process, we use the ADE and FDE metrics to carry out the evaluation—firstly, in situations where several samples are required, such as during the evaluating procedure, sampling from the estimated distribution . However, this approach may ignore other sub-optimal ones such as samples from some redundant adjacent regions or low-probability regions, which leads to multimodal deficiency. Hence, the proposed TMSS is used to evaluate the influence of the number of target samples, which reflects the role of the zero-input response in the diffusion process. In addition, PNMSS is used to evaluate the influence of path multimodality on the already sampled targets, which reflects the role of the zero-state response in the diffusion process.
4.5.1. Analysis of TMSS for Zero-Input Response
To evaluate the performance of TMSS in the proposed thermal diffusion process, we fix
to observe the evolution of
in
Figure 7. For a given
, ADE tends to decrease with the increase in
, indicating that
can help
improve the performance, which validates the necessity of behavioral uncertainty
. For the target multi-modality parameter
given by past historical trajectory states, the diversification of path multi-modality parameter
depending on future random factors can reduce ADE errors to some extent. This phenomenon indicates that the zero-input response sets feasible goals through TMSS according to the tendency of the past trajectory; thus, the above results of variable
validates the finding that TMSS reasonably provides diverse targets for subsequent multimodal path selection in the thermal diffusion process.
Moreover, a fast K-means clustering approach is used to set the number of clusters to
; all cluster centers are generated by the K-means approach, along with the softargmax sampled point. In this way, the TMSS strategy controls the sampled points directly, which does not need the "truncation trick" proposed in PECNet [
20].
Figure 3 shows the visualization result of the proposed method; both tabular and visual results validate the effectiveness of the proposed complete response-based thermal diffusion mechanism in our method.
4.5.2. Analysis of PNMSS for Zero-State Response
In addition to the analysis of
, to further explore the role of zero-state response in the proposed thermal diffusion process, we also report the results of different
on the SDD dataset to observe its evolution performance in
Figure 7, which shows that ADE decreases as
increases for a given
, and indicates the effective use of multi-modality stochastic parameters.
Specifically, for each curve with a fixed
in
Figure 7, consistent improvement is observed in ADE with increasing
, indicating the effectiveness of the diversity of PNMSS, which depends on stochastic factors from new inputs after the zero-moment (prediction start), which corresponds to the zero-state response generated by the random state input after the zero moment in the proposed thermal diffusion process. Moreover,
Table 2 reports the LSN-GTDA’s performance compared with different baselines by using different
; we observe that our proposed method has improved significantly in ADE on the SDD dataset, which indicates that the multi-modality of paths can reduce the error of estimating the same final target. Hence, the results in
Table 2 also validate the finding that a moderate gain of path modality (
) can reduce prediction errors in long-term cases.
To further evaluate the components of the thermal diffusion process in our method, we conduct an additional ablation study on the SDD benchmark. The results shown in
Table 3 validate the effectiveness of long-term prediction with both TMSS and PNMSS strategies; “None (LSN-GTDA)” indicates that the proposed LSN-GTDA does not use the thermal diffusion process, and “TMSS (LSN-GTDA)”, “PNMSS (LSN-GTDA)”, and “TMSS+PNMSS (LSN-GTDA)” represent the use of the respective module and whole parts of the thermal diffusion process, respectively. The results in
Table 3 show that the whole parts perform much better than others in long-term cases, which demonstrates the effectiveness of the proposed global thermal diffusion analysis for behavioral and stochastic factors in pedestrian trajectory prediction.
Additionally,
Figure 7 indicates the impact of diversity brought from both multimodalities of various choices of
and
; the results show consistent ADE improvements for various
while increasing
, which indicates the effectiveness of multimodality. According to
Table 3, the baseline of “None (LSN-GTDA)” lacking the multimodality strategy performs much worse than other baselines using at least one multimodality strategy. Through the multimodality decomposition of the path and the target, the stochastic factor and the behavioral factor of the uncertainty correspond. Thus, the degree of uncertainty is reduced, and the prediction error is also reduced, which highlights the importance of factorizing target and path multimodality for diverse and future trajectory modeling.
4.6. Parameter Workload Analysis
In addition,
Table 4 shows the inference time and parameter workload of state-of-the-art deep learning-based models. Our LSN-GTDA method is lighter than other methods, only slightly higher than Y-Net by about 13.4% since LSN-GTDA has a symmetrical network. However, we observe that in the training stage, the proposed LSN-GTDA algorithm converges after 50 epochs of iteration, while the Y-Net baseline needs 200 epochs to converge. Therefore, compared with the Y-Net baseline, our LSN-GTDA algorithm shortened the training convergence time by 75% while increasing the number of network model parameters by only 13.4%. As shown in
Table 4, the inference time of our method shows low-latency trajectory prediction in the training stage, which indicates that the proposed LSN-GTDA model has realized the real-time forecasting goal by using the proposed complete response mechanism-based thermal diffusion process rather than the traditional diffusion models based on time-consuming denoising processes. In the test stage, LSN-GTDA costs 23.86 s for each long-term scene in SDD, and also obtained better results than Y-Net, which shows the superiority of our method.
4.7. Conclusions and Discussion
In this paper, we propose a human trajectory forecasting method by a novel mechanism of global energy analysis, which models the uncertainty of prediction from the behavioral factor and the stochastic factor by the zero-input response and the zero-state response. Our method reveals the inherent principle of the multiple modalities of the human trajectory and provides a solution to learn the principle of human motion from UAV data.
For downstream applications, multimodal human trajectory prediction can be applied to a wide range of fields involving human–robot interaction. By incorporating uncertainty modeling, accurate and reliable future trajectories can be generated, which is crucial for supporting autonomous driving decisions. Furthermore, the proposed LSN-GTDA method offers interpretability in addressing uncertainty, making it a promising solution for more dynamic and interactive domains, such as human–computer interaction, action recognition, UAV-based object tracking, and person re-identification.
Expanding human trajectory prediction from UAV scenarios to pedestrian tracking is feasible and can be beneficial. They share similar dynamics of human movement, and the sensor fusion may assist data from UAVs to enhance tracking models. However, there are also some challenges such as their environmental differences; UAVs usually operate in open areas while pedestrians often encounter obstacles, varying terrain, and a higher density of objects, which may lead to a loss or mismatch in general tracking algorithms, e.g., SORT and ByteTrack. One feasible solution is to design a person re-identification algorithm under a broad UAV scenario to match the pedestrians detected in the next frame after the tracking target is lost; this may quickly recover the tracking.
In the future, we will expand the proposed method to a scenario of several pedestrians such as a group and try to promote it with related work on person and group re-identification [
63,
64] under UAV scenarios.