The goal of the proposed method is to detect human poses in lidar frames recorded by an NRCS lidar sensor. The proposed method is an end-to-end solution for moving person detection and pose estimation in a surveillance use case, where the NRCS lidar sensor is placed in a fixed position. The human pose is represented by an ordered list of anatomical keypoints, referred to as joints hereinafter.
The human pose estimation task can be applied in surveillance applications, which demand real-time solutions. To address this need, our approach involves transforming the representation of the NRCS lidar point cloud from 3D Cartesian coordinates to a spherical polar coordinate system, similar to our previous works in [
39,
40]. We generate a 2D pixel grid by discretizing the horizontal and vertical FoVs, where each 3D point’s distance from the sensor is mapped to a pixel determined by corresponding azimuth and elevation values. The polar direction and azimuth angles correspond to the horizontal and vertical pixel coordinates, while the distance is encoded as the ‘gray’ value of the respective pixel. This process allows the subsequent steps of our proposed lidar-only 3D human pose estimation method to be developed within the domain of 2D range images.
Depending on the timing window of data collection, as illustrated in
Figure 1, the range image of a specific lidar frame may contain numerous pixels with undefined range values due to the NRCS scanning pattern. The number of these undefined pixels depends on both the measurement integration time and the predefined dimensions of the range image. In our experiments, we leveraged the precision parameters of the Livox Avia sensor, mapping its FoV onto a
pixel grid, resulting in a spatial resolution of
pixels per degree. It is important to note that the density of the recorded valid-range values decreases towards the periphery of the range image due to the scanning technique: the scanning pattern crosses the sensor’s optical center more frequently than it covers the perimeter regions of the FoV. This 2D range image-based data representation facilitated the efficient and robust utilization of sparse lidar data.
The proposed method is based on the state-of-the-art ViTPose [
28] human pose estimation method, working on camera images, based on a vision transformer (ViT) architecture [
23], which was trained on the COCO dataset [
41].
2.2. LidPose
The proposed
LidPose method is an end-to-end solution which solves the human detection and pose estimation task using only NRCS lidar measurements, in a surveillance scenario, where the sensor is mounted in a fixed position. The
LidPose method’s workflow is shown in
Figure 3.
First, the moving objects are separated from the static scene regions in the NRCS lidar measurement sequence by applying a foreground–background segmentation technique that is based on the
mixture-of-Gaussians (MoGs) approach adopted in the range image domain, as described in [
39]. A local background (Bg) model is maintained for each pixel of the range image, following the MoGs approach [
42] applied for the range values. Due to the sparsity of the captured point clouds, within a given time frame, only the MoGs background model components of range image pixels corresponding to the actual measurement points are updated. The incoming measurement points are then classified as either foreground or background by matching the measured range values to the local MoGs distributions.
Second, the foreground point regions are segmented to separate individual moving objects, and the footprint positions of the detected pedestrian candidates are estimated. Here, a 2D lattice is fitted to the ground plane, and the foreground regions are projected to the ground. At each cell in the ground lattice, the number of the projected foreground points is counted; this is then used to extract each foot position, as described in [
43]. The result of this step is a set of bounding boxes for the detected people, which can be represented both in the 3D space and in the 2D range image domain. As shown in [
43], due to the exploitation of direct range measurements, the separation of partially occluded pedestrians is highly accurate; however, in a large crowd the efficiency of the approach can deteriorate.
In the next step, the NRCS lidar point cloud and the range image are cropped with the determined bounding boxes. The cropped regions correspond to lidar measurement segments containing points either from a person or from the ground under their feet.
To jointly represent the different available measurement modalities, we propose a new 2D data structure that can be derived from the raw lidar measurements straightforwardly and can be efficiently used to train and test our proposed LidPose model. More specifically, we construct from the input point cloud a five-channel image over the lidar sensor’s 2D range image lattice, where two channels directly contain the depth and intensity values of the lidar measurements, while the remaining three layers represent the X, Y, Z coordinates of the associated lidar points in the 3D world coordinate system.
Note that in our model, the pose estimator part of the method is independent of the sensor placement. While in this paper we demonstrate the application purely in a static lidar sensor setup, we should mention that with an appropriate segmentation method for a given scene, the LidPose pose estimation step could also be adapted to various—even moving—sensor configurations.
To comprehensively explore and analyze the potential of using NRCS lidar data for the human pose estimation task, we introduce and evaluate three alternative model variants:
LidPose–2D predicts the human poses in the 2D domain, i.e., it detects the projections of the joints (i.e., skeleton keypoints) onto the pixel lattice of the range images, as shown in
Figure 4a. While this approach can lead to robust 2D pose detection, it does not predict the depth information of the joint positions.
LidPose–2D+ extends the result of the LidPose–2D prediction to 3D for those joints, where valid values exist in the range image representation of the lidar point cloud, as shown in
Figure 4b. This serves as the baseline of the 3D prediction, with a limitation that due to the sparsity of the lidar range measurements, some joints will not be associated with valid depth values (marked by blue boxes in
Figure 4b).
LidPose–3D is the extended version of LidPose–2D+, where depth values are estimated for all joints based on a training step. This approach predicts the 3D human poses in the world coordinate system from the sparse input lidar point cloud, as shown in
Figure 4c.
The ViTPose [
28] network structure was used as a starting point in the research and development of the proposed LidPose methods’ pose estimation networks. Our main contributions to the proposed LidPose method:
A new patch-embedding implementation was applied to the network backbone to handle efficiently and dynamically the different input channel counts.
The number of transformer blocks used in the LidPose backbone was increased to enhance the network’s generalization capabilities by having more parameters.
The output of the LidPose–3D configuration was modified as well by extending the predictions’ dimensions to be able to predict the joint depths alongside the 2D predictions.
As
Figure 3 demonstrates, the
LidPose network structure can deal with different input and output configurations depending on the considered channels of the above-defined five-layer image structure. The optimal channel configuration is a hyperparameter of the method that can be selected upon experimental evaluation, as described in detail in
Section 4. In our studies, we tested the LidPose networks with the following five input data configurations:
For the training and testing of the proposed method, a new dataset was introduced, comprising an NRCS lidar point cloud segment and the co-registered human pose ground truth (GT) information for each sample object. The dataset is described in detail in
Section 3. The three model variants introduced above are detailed in the following subsections.
2.2.1. LidPose–2D
For pose estimation in the 2D domain, the
LidPose–2D network was created based on the ViTPose [
28] architecture. The patch-embedding module of the ViTPose backbone was changed to handle custom input dimensions for the different channel configurations (
XYZ, D, I, and their combinations).
This new network architecture was trained end-to-end from an untrained, empty network. Five different networks were trained for the input combinations listed above. For these methods of predicting 2D joint positions, the training losses were calculated in the joint heatmap domain. An example of the
LidPose–2D prediction can be seen in
Figure 4a.
2.2.2. LidPose–2D+
In this model variant, called LidPose–2D+, the 2D predictions created by LidPose–2D configuration are straightforwardly extended to the 3D space.
Each predicted 2D joint is checked, and if a valid depth measurement exists around the joint’s pixel location in the lidar range image, the 3D position of a given joint is calculated from its 2D pixel position and the directly measured depth value. This transfer from the 2D space to the 3D space implies a simple baseline method for 3D pose prediction models. However, the
LidPose–2D+ approach has a serious limitation originating from the inherent sparseness of the NRCS lidar point cloud. Two-dimensional joints whose positions are located in regions with missing depth measurements in the 2D range image cannot be extended to 3D. An example of a
LidPose–2D+ prediction is shown in
Figure 4b, highlighting three joints that cannot be assigned to range measurements.
2.2.3. LidPose–3D
The limitations of LidPose–2D+ can be eliminated by a new network called LidPose–3D that aims to predict the depth of each detected joint, separately from its pixel position in the range image lattice. Similarly to the LidPose–2D variants described above, this network structure can handle inputs with different configurations of the XYZ, D, and I channels.
The
LidPose–3D network’s output is constructed as an extension of ViTPose [
28] to predict depth values for the joints alongside their 2D coordinates. The normalized depth predictions are performed on a single-channel 2D depth image in the same down-scaled image space (
) where the joint heatmaps are predicted. An example of a
LidPose–3D prediction can be seen in
Figure 4c.
2.3. LidPose Training
The training input data are a 2D array with a given number of channels—depending on the training configuration (combinations of
XYZ, D, I). For the different channel configurations, different patch-embedding modules were defined to adopt the variable numbers of parameters in the input, as shown in
Figure 3. For training and evaluation of the network, we also need the ground truth pose data, which we assume is available at this point. (Details of ground truth generation will be presented in
Section 3).
Regarding the loss function of the
LidPose–2D network, we followed the ViTPose [
28] approach by using
mean squared error (MSE) among the predicted and the ground truth heatmaps:
where
and
are the predicted joint heatmap and the ground truth joint heatmap, respectively.
For the
LidPose–3D network, the training loss is composed of two components: one responsible for the joints’ 2D prediction accuracy (
), the other reflecting the depth estimation accuracy (
). The total training loss is a weighted sum of the position and depth losses:
For calculating the 2D joint position loss term
, Equation (
1) was used again. Regarding the depth loss
, we tested three different formulas:
L1 loss,
L2 loss, and the
structural similarity index measure (SSIM) [
44]. Based on our evaluations and considering training runtime, the
SSIM was selected for the depth loss measure in the proposed
LidPose–3D network. Following a grid search optimization, the weighting coefficients in the loss function were set as
and
.