1 Introduction

Estimating human body pose and shape of a person lying on a patient table or hospital bed, referred to as patient body modeling in this write-up, has a broad range of applications in healthcare [6, 13]. Some examples include long-term monitoring to track patient movement during epileptic seizures [2], radiation therapy assistance [5], and scanning workflow improvement [15]. Due to its wide-ranging applications in multimedia, safety and surveillance, as well as diagnostic and therapeutic healthcare, human body pose and shape estimation has been widely studied [7, 14, 17]. Shotton et al. [14] use random forests to obtain a pixel wise labeling of the depth image, followed by regression to estimate the 3D human skeleton model. Weiss et al. [17] use SCAPE [3] to fit a 3D deformable human mesh to the depth data; recently this approach was greatly enhanced by Bogo et al. [7] to estimate the body shape and pose parameters from an RGB image and several landmarks on the body surface; these body landmarks can be detected using deep convolutional nets [4] or deep reinforcement learning [9].

In spite of aforementioned breakthroughs in human body and shape analysis, patient body modeling remains challenging. The challenges mainly lie in the demanding accuracy requirements from healthcare applications, the shape of the lying down person, as well as the fact that patients are often under loose clothing covers such as patient gowns or hospital blanket. Grimm et al. [10], Singh et al. [15], and Achilles et al. [2] are the closest to look into these challenges. Nevertheless, none of the above approaches estimates and evaluates a detailed surface mesh on real patient data or addresses the challenges involved in estimating such a mesh in the setting of patient body modeling. A solution that addresses these issues may not only result in improved accuracy of landmarks for patient monitoring [6, 15] but would potentially enable novel use cases for scanning.

In this paper, we present DARWIN - a method to estimate a detailed body surface mesh of a covered patient lying on a table or a bed from a single snapshot of a range sensor. The proposed algorithm detects patient landmarks under clothing or blanket cover using deep convolutional networks trained over a large dataset of real patients and volunteers. The algorithm then employs a deformable human mesh model [3] which is learned from patient meshes. Our contributions can be summarized as follows: (1) Robust pose and landmark detection under clothing by training on real patient datasets (2) Learning a deformable human mesh model adapted for accurate patient body modeling (3) Training and evaluation on more than 1000 human subject data. Evaluation by comparing estimated patient mesh with the skin surface from CT data shows promising results.

2 Method

In this section, we present our approach, DARWIN (Deformable patient Avatar Representation With deep Image Network) to model lying-down patient under covers. DARWIN estimates the patient mesh from a single snapshot from a range imaging device such as Microsoft Kinect 2 or ASUS Xtion mounted on the ceiling pointing towards the table or a hospital bed. We employ a coarse to fine workflow, inspired by [7, 15, 17], estimating finer details about the patient geometry as each module processes the data. The workflow starts by classifying the patient pose with deep image network into head first or feet first and prone or supine based on the orientation of the patient. 15 landmarks on the patient surface are detected. Together they are sufficient to define a coarse skeletal structure of the human body. Finally, we use a learned patient centric deformable mesh model to fit to the 3D measurements of the patient surface. In the following sections, we provide details for each of the processing steps.

2.1 Body Pose Classification and Landmark Detection

We first compute the 3D point cloud from the depth data using the calibration information and crop a 3 m (along table length) \({\times } 1.5\) m (along table width) \({\times }2\) m (along table normal) region containing the table and the patient. Next, we project the point cloud orthographically to obtain a 2.5D depth image along the axes of the table; this is similar to the bed aligned maps [10] or reprojected depth maps [15]; for simplicity, we refer to them as depth feature maps.

For pose classification, we employ a convolutional neural network (CNN) for classifying the given depth feature maps into head-first prone, feet-first prone, head-first supine, or feet-first supine. Our classification network consists of 4 convolutional and 3 fully-connected layers. Each convolution has 64 \([5\times 7]\) filters. A \([2\times 2]\) max-pooling layer was used after each convolutional layer. Finally, the convolutional section is followed by 3 fully-connected layers with 64, 32, and 4 nodes, consecutively; for each layer, we used Rectifier Linear Units (ReLUs). The training of the network is achieved by minimizing the “categorical cross-entropy” using Adaptive Moment Estimation (Adam) [11]. To avoid over-fitting, we augment the training data by horizontal and vertical flipping as well as adding regularization using spatial dropout [16] before the convolutional layers and regular dropout before the fully-connected layers. Given the patient pose, we next detect the location of 15 body markers: head top/bottom, shoulders, elbows, wrists, torso, groin, knees, and ankles. Accurate localization of these markers has great importance since they are used in the initialization of the patient surface mesh in the next stage.

Fig. 1.
figure 1

Fully convolutional landmark localization network.

We employ an ensemble of SegNet [4] based fully convolutional networks (FCN) to achieve efficient and accurate body marker localization. Each model in our ensemble has a “flat architecture” (see Fig. 1 as an example). Depending on the model, each convolutional layer has either 26 or 32 filters and all filters have the same [7, 7] dimensions. Each model has either four or five encoding/decoding stages and each encoding/decoding stage has three consecutive convolutional layers. Spatial dropout is used before the output layer to avoid over-fitting. The output layer is a convolutional layer with 15 [1, 1] filters to make this an efficient FCN. The training of each network is achieved by minimizing the “mean-squared error” between the predicted marker-specific heatmaps and the ground truth heatmaps. The optimization is done using the ADAGRAD technique [8]. Once the models are trained, we combine them in a simple additive ensemble dased on the validation dataset. We observed that combining individual models in an ensemble significantly reduced the number of outlier detections and made the localization more robust.

Addressing Clothing Cover: For robust detection even under clothing cover, we trained the deep network over real patient datasets collected from multiple sites with significant clothing variations that are typically observed during medical scans, ranging from casual clothing to patient gowns and covers. For consistent annotation across patients, we utilized the data acquired from multiple modalities. Besides RGB and depth images, we also presented the annotators with surface normal images coded as RGB images as well as the aligned medical data (topogram). Figure 2(a) shows aligned images from various modalities for a patient. This helps us acquire more accurate and consistent landmarks across the patient body and results in more robust detectors.

Besides data from real patients, we also collected data in a lab environment. For each volunteer, we collected data at various table positions and patient poses and for each acquisition, we acquired data both with and without a soft blanket cover without moving the patient. Figure 2(b) shows one such pair of images from the acquired dataset. To minimize annotation biases, we placed color markers at the landmark positions, which were visible in the color camera but not visible in the depth camera. For images with clothing cover, we reuse the annotation of the corresponding data (of the same person in the same pose at the same table position) without the cover. Augmentation of the real patient data with the lab data significantly helped the network training, especially in handling clothing cover as well as dealing with other biases such as patient pose.

Fig. 2.
figure 2

Data used to training body markers. (a) Data from multiple modalities aligned with the corresponding depth feature map to aid the annotation process. (b) Depth feature maps of a person with and without clothing cover

2.2 Patient Centric 3D Body Shape Estimation

Given the location of the landmarks, we reconstruct the 3D dense patient body surface, which is represented as a polygon mesh. The reconstructed 3D model is obtained using a parametrized deformable mesh (SCAPE [3]), which can be efficiently perturbed to a target body pose and shape. [3] simplifies the deformation model by decoupling pose and shape perturbations and during inference, optimizes the parameters in an iterative framework. We adapt this model for accurate patient mesh modeling in 2 ways - firstly, we learn a shape deformation model for patient body surface lying on a table, and secondly, we use a coarse to fine registration method to identify correspondences between the deformed template mesh and the 3D surface data.

Learning Patient Shape Deformation. To learn a deformation model for the patient mesh, we must first obtain full body meshes from several patients and furthermore, these meshes must be registered i.e. have a point to point correspondence. Since such registered full body patient meshes are difficult to obtain, we use human body scans from various data sources and modify them to obtain patient-like body mesh. We first learn the SCAPE deformation model (pose and shape) using a dataset of 3D human body scans generated by perturbing the parameters for a character in POSER [1]. This learned model captures human body variations but certainly fails to capture the necessary body shape details. To this end, we fit the learned SCAPE model to more than 1000 detailed 3D human body scans from the CAESAR dataset [12], which includes human subjects with significant shape variations wearing tight clothing. We use the registered CAESAR fitted meshes to retrain the SCAPE model, which enables it model more realistic and detailed body shape variations. However, the shape to deformation model is still trained on standing patients, which fails to capture the deformation of the body shape when a human subject is lying on the table. We address this by simulating the placement of the CAESAR fitted meshes on a table surface mesh obtained from its CAD data; since the fitted meshes are registered to the learned SCAPE model, we use the SCAPE model to change the skeletal pose of the mesh to simulate the lying down pose and deform the back surface (from neck to hip) of the mesh such that its flat, thereby simulating the effects of gravity of the loose body tissue. Next, we use the “gravity” simulated meshes to retrain the shape model of SCAPE. While the simulation addresses the back surface well, the deformation of other soft tissues may still not be addressed; to this end, we collected several depth and CT image pairs with varying patient shapes and having a large field of view of the CT image (e.g. over the thorax and abdomen areas) and fit the “gravity” simulated SCAPE to both the depth surface and CT skin mesh of the patient. Finally, we learned the shape deformation model by augmenting the “gravity” simulated dataset with meshes fitted jointly on depth and CT. Due to the scarcity of the depth-CT image pairs with necessary shape variation and the field of view, the “gravity” simulated training was necessary in order to learn a good shape deformation model. Figure 3 illustrates data generated for training lying-down person shape.

3 Evaluation

To validate the performance of our approach, we collected data using an Microsoft Kinect 2 sensor mounted either on the ceiling above a CT scanner or on top of the CT gantry from 3 different hospital sites in North America and Europe. The sensors were calibrated w.r.t. the CT scanner by using standard calibration techniques for cameras with color and depth sensors. For our experiments, we collected data from 1063 human subjects with different age, body shapes/sizes, clothing and ethnicity. 950 were real patients from three hospitals and 113 were volunteers from two other sites. For each patient, we collected images up to three different table heights and the corresponding CT data. For volunteer, we collected 3 to 40 images to cover the same subject at different table positions in variety of body poses, with and without clothing cover. Since we don’t have CT scan for volunteer data, they are only used to train and evaluate pose and landmark detection. As a result, we collected 9872 raw range images from all the subjects. All the evaluation is performed on a desktop workstation with Intel Xeon CPU E5-2650 v3 with 128 GB RAM and nVidia GTX Titan X. Computation time for pose and 15 landmarks is averaged at 110ms, and the optimization of surface matching costs is averaged at 750 ms.

Fig. 3.
figure 3

Two types of data generated for training DARWIN - (a) Through gravity simulation and (b) Through CT data with shape completion

The deep convolutional network is used to obtain pose and initialization for our final fitting. Our first experiment evaluated the pose and landmarks accuracy. In this experiment, \(75\%\) of the subjects were used for training and \(25\%\) of the subjects were used for testing. For comparison, we included the technology (PBT) presented in [15]. The pose classification network yields an accuracy of \(99.63\%\) on the testing data comparing to \(99.48\%\) from PBT. To evaluate body surface markers, we report errors as the Euclidean distance between the ground truth location and the estimated location. Figure 4(a) compares the median, mean and \(95^{th}\) percentile error obtained using our model and the PBT [15]. While the proposed method significantly outperforms the PBT, the Euclidean errors of our detected landmarks are also notably smaller than the numbers reported in [2] though on different datasets. Our next experiment compared our landmark performance on patient without and with covers. Among all test images, \(21\%\) of the images are covered patient and \(79\%\) of the images are patients without covers. By comparing covered and uncovered patients, the difference between mean Euclidean distance errors of both wrist landmarks are less than 1cm and all other landmarks are less than 0.4cm. This demonstrates our network learns the landmarks well even when subjects are covered. Figure 4(b) shows detected landmarks on subjects with and without covering.

Fig. 4.
figure 4

(a) Comparison of median, mean, and \(95^{th}\) percentile of Euclidean distance (right). (b) Two pairs of examples from testing dataset comparing landmark detection of covered and uncovered patient. Green dots are groundtruth and red dots are detection results.

For evaluating the accuracy of the estimated patient mesh, we compare it with the CT skin surface mesh on a dataset of 291 patients (unseen during the entire training process). Since the CT scans only cover a part of the patient body, hence the comparison for each scan is limitted to the field of the CT scan. Our evaluation dataset covers different body ranges \(7\%\) head, \(40\%\) chest/thorax, \(43\%\) abdomen, \(10\%\) rest including extremities, which ensures the evaluation is not biased to certain body regions. We measured the Hausdorff and the mean surface distance between the Ground truth skin mesh (CT) and estimated mesh. The overall Hausdorff distance was 54 mm and mean surface distance was 17 mm. Figure 5(a) shows several result overlays on the patient data. Notice even under the clothing cover, suggested by the depth surface profile in yellow, the estimated patient mesh is close to the CT skin surface. We also evaluate the SCAPE model without the shape training using CT and “gravity” simulated meshes. The CT trained shape model reduces the error by \(20\%\) in the abdomen area which is the area with the largest shape deformation between standing and lying-down. Figure 5(b) shows silhouette of DARWIN fitted mesh with CT data on a subject with blanket cover.

Fig. 5.
figure 5

(a) Overlay of the estimated patient mesh (red), depth surface (yellow) and CT skin surface (green) on orthographically projected lateral CT view; (b) DARWIN fitting to a subject with blanket cover

4 Conclusion

In this paper, we present DARWIN - modeling 3D patient geometric surface driven by Deep Image Network. Specifically, DARWIN addresses the challenges of modeling the shape of a lying-down person under loose covers. In order to do so, DARWIN is trained from a big amount of real clinical patient data with pairs of depth sensor and CT images. Promising results demonstrate that DARWIN can provide accurate and robust estimation of patient pose and geometry information for clinical applications such as more efficient scanning workflow, patient motion detection, collision avoidance, pose verification, etc. Our future work includes speeding up the computation time to enable real time mesh generation as well as handling arbitrary patient poses to enable more potential benefits for various clinical applications.