Temporally guided articulated hand pose tracking in surgical videos

Louis, Nathan; Zhou, Luowei; Yule, Steven J.; Dias, Roger D.; Manojlovich, Milisa; Pagani, Francis D.; Likosky, Donald S.; Corso, Jason J.

doi:10.1007/s11548-022-02761-6

Temporally guided articulated hand pose tracking in surgical videos

Original Article
Open access
Published: 03 October 2022

Volume 18, pages 117–125, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Temporally guided articulated hand pose tracking in surgical videos

Download PDF

Nathan Louis ORCID: orcid.org/0000-0003-4502-6012¹,
Luowei Zhou²,
Steven J. Yule³,
Roger D. Dias⁴,
Milisa Manojlovich⁵,
Francis D. Pagani⁶,
Donald S. Likosky⁶ &
…
Jason J. Corso¹

2807 Accesses
5 Citations
Explore all metrics

Abstract

Purpose

Articulated hand pose tracking is an under-explored problem that carries the potential for use in an extensive number of applications, especially in the medical domain. With a robust and accurate tracking system on surgical videos, the motion dynamics and movement patterns of the hands can be captured and analyzed for many rich tasks.

Methods

In this work, we propose a novel hand pose estimation model, CondPose, which improves detection and tracking accuracy by incorporating a pose prior into its prediction. We show improvements over state-of-the-art methods which provide frame-wise independent predictions, by following a temporally guided approach that effectively leverages past predictions.

Results

We collect Surgical Hands, the first dataset that provides multi-instance articulated hand pose annotations for videos. Our dataset provides over 8.1k annotated hand poses from publicly available surgical videos and bounding boxes, pose annotations, and tracking IDs to enable multi-instance tracking. When evaluated on Surgical Hands, we show our method outperforms the state-of-the-art approach using mean Average Precision, to measure pose estimation accuracy, and Multiple Object Tracking Accuracy, to assess pose tracking performance.

Conclusion

In comparison to a frame-wise independent strategy, we show greater performance in detecting and tracking hand poses and more substantial impact on localization accuracy. This has positive implications in generating more accurate representations of hands in the scene to be used for targeted downstream tasks.

Using hand pose estimation to automate open surgery training feedback

Article 30 May 2023

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

Enhancing 2D Hand Pose Detection and Tracking in Surgical Videos by Attention Mechanism

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Machine learning and computer vision have become increasingly integrated with healthcare in the medical community. This is apparent in the myriad of tasks, such as tumor segmentation [1], technical skill assessment [2,3,4,5,6], and tool detection and tracking [7,8,9,10]. Here we study the problem of articulated hand pose tracking in the surgical domain. Tracking hand poses can facilitate other useful tasks, such as technical skill assessment, temporal action recognition, and training surgical residents. Pose tracking in the computer vision community is primarily centered around human poses [11,12,13,14,15,16,17,18,19], while medical works focus on detecting and tracking surgical instruments [7,8,9,10]. Tracking surgical instruments is useful but these instruments are inherent to the surgical procedures seen during training. Instead we abstract away the emphasis on surgical instruments where articulated hand tracking will be more applicable to broad surgical tasks. Articulated hand pose tracking can highlight important properties such as grip, motion, and tension that human experts often attend to when evaluating videos.

A challenge in pose tracking is the temporal consistency of predictions between frames, the lack of which leads to flickering and improbable changes in estimated poses. Existing works [11, 14, 17,18,19] in articulated pose tracking use frame-wise independent predictions along with post-processing when tracking [12, 13, 15, 16] to gather temporal context. However, they do not integrate past inferences when localizing joints. We address this by proposing CondPose, a new model that performs predictions conditioned on the pose estimates from prior frames. In Fig. 1, we show a comparison of both approaches: the baseline using frame-wise independent predictions and our model using conditional predictions. The initial estimate may fluctuate due to varying factors, such as lighting, hand orientation, or motion blur. But we find that using prior predictions as guidance, we can improve our localization accuracy. The internal representation of this object’s state (position, appearance, and classification) is a function of its current state and previous states. By learning this Markovian prior for the prediction of hand joints, we can improve both pose estimation and consequently tracking accuracy.

There is a lack of data and benchmarks for articulated hand pose tracking. To address this, we collect a novel dataset featuring intra-operative videos of real surgeries, Surgical Hands. We annotate the articulated hand poses of surgeons which subsumes both surgical instrument and non-instrument actions, e.g., suturing, knot-tying, and gesturing. We are, to the best of our knowledge, the first to introduce a labeled dataset for both detection and tracking of multiple articulated hand poses. We benchmark our dataset against existing tracking baselines and demonstrate the superiority of our proposed approach on both hand pose estimation and tracking.

Our contributions are as follows:

We introduce CondPose, a novel deep network that takes advantage of confident prior predictions to improve localization accuracy and tracking consistency.
We present Surgical Hands^{Footnote 1}, a new video dataset for multi-instance articulated hand pose estimation and tracking in the surgical domain.
We set new state-of-the-art benchmark performance on Surgical Hands.

Related works

Articulated pose estimation and tracking

Surgical instruments

Data-driven methods in the medical video domain primarily involve RAS videos. Works in this space [3,4,5] traditionally use kinematic data directly, requiring an external apparatus to capture these measurements. But full kinematic information is only available for robotic-controlled tools, even less so for hand-held instruments. Adding any external apparatus to capture kinematic data can negatively impact the costs, flexibility, and performance of certain operations. For detection, pure computer vision-based approaches extract information directly from video data to perform object detection. Many vision works use a region proposal network to perform localization [7, 20, 21], segmentation [9, 22], and articulated pose estimation [8, 23] from images.

To incorporate tracking, existing works may use a similarity function based on weighted mutual information [24] or Bayesian filtering as part of a minimization problem [25]. Nwoye et al. [10] are the first to measure the Multiple Object Tracking Accuracy (MOTA) [26] for surgical instruments in this setting, using a weakly-supervised approach with coarse binary labels indicating the presence or absence of seven surgical instruments. However, their evaluation contains at most one unique type of tool at each frame; hence, can be narrowed down to an object detection problem. Unlike their work, we track multiple instances of the same object in each frame. We also use MOTA as part of our benchmark when tracking hands in our videos.

Human pose

Pose estimation and tracking is commonly applied to images and videos of people, grouped into top-down [12,13,14,15,16] and bottom-up [17,18,19] strategies. Top-down methods detect all persons from an image, then regress each human pose independently using a pose estimation network. Bottom-up methods detect all joints in an image, and use bipartite matching and graph minimization techniques to assign joints to each person. As top-down approaches typically perform best in practice, we follow this paradigm. For tracking, [12] uses a greedy matching from IoU (intersection-over-union) overlap and optical flow to propagate bounding boxes between frames, [13] use deformable convolutions to warp predictions between frames, and [15] introduce a Graph Convolutional Network (GCN) [27] to match learned embeddings between human poses. A GCN is a neural network whose input consists of a set of nodes and edges, performing convolution operations on the relations of nodes. The inherent structure of this graph can improve quality of learned features as well as abstracting from limitations of a 2D space. These approaches spatially shift pose predictions, which cannot overcome certain factors (e.g., missed detections). In contrast, we address this problem at the detection step by integrating past pose observation(s) into each new predicted output.

Hand pose

Current works on 2D hand pose estimation [28,29,30] are analogous to human pose estimation. Zhang et al. [31] performs pose tracking, using a disparity map from stereo camera inputs to estimate a 3D hand pose. However their data consists of only a single subject’s hand and at most one detection per frame. There are many image datasets [28, 30,31,32] for hand pose estimation, from a combination of manual, synthetic, and predicted annotations. But none satisfy the conditions of multiple object instances and tracking from video, more so in a surgical setting. Therefore, we introduce the Surgical Hands dataset for multi-instance articulated hand pose tracking. Our dataset includes varying lighting conditions, fast movement, and diversity in scene appearances. Distinctively, we also include gloved hands, which appear in contrasting colors such as latex and green.

Method

We propose CondPose, to perform articulated pose detection and tracking by incorporating previous observations as prior guidance. We show our model in Fig. 2. While the baseline produces a heatmap from each hand using a pose estimation network, we leverage past predictions to produce conditioned hand pose outputs, improving detection performance during inference. While we design CondPose with video data in mind, we begin with pretraining on image data, finetuning on our video dataset, Surgical Hands, and lastly, comparing between different tracking methods.

Hand pose estimation in images

We first pretrain on image data, defining the input and output for the pose estimation network, P, as $\hat{\mathcal {H}} = P(\mathcal {I})$. The input is an image crop $\mathcal {I}$, $\mathcal {I} \in \mathbb {R}^{H \times W \times 3}$, and the output is a predicted heatmap $\hat{\mathcal {H}}$, $\hat{\mathcal {H}} \in \mathbb {R}^{H' \times W' \times J}$. Here H, W represents the input image height and width and $H', W'$ are the output heatmap height and widths. J represents the number of predicted joints of each hand. Each image crop is scaled to 2.2 times the total area of the hand bounding box. We train using the mean squared error (MSE) between the ground truth and predicted heatmaps as $\mathcal {L} = \Vert (\mathcal {H} - \hat{\mathcal {H}}) \odot \mathcal {M} \Vert ^2$. The ground truth heatmaps, $\mathcal {H}$, are generated from 2D Gaussians centered on each annotated keypoint. $\mathcal {M}$, is included to mask out un-annotated joints. The output joint locations are the max value positions in the third channel of $\hat{\mathcal {H}}$. After pretraining, we finetune our model on videos to learn conditional hand pose predictions.

Hand pose estimation in videos

While image data cannot be used to learn our conditional hand pose predictions, we can initialize weights to speed up our training process and improve generalizability. We finetune CondPose on Surgical Hands, shown in the top portion of Fig. 2. To incorporate a prior branch, we introduce a heatmap prior, $\hat{\mathcal {H}}_{t-\delta }$, a pose estimate of the same object from $t-\delta $. Our model performs conditional predictions, defined as

$$\begin{aligned} \hat{\mathcal {H}}_t = M_\mathrm{fus}(P (\mathcal {I}_t); M_\mathrm{att}(v_t; \hat{\mathcal {H}}_{t-\delta })) . \end{aligned}$$

(1)

In contrast to our previous definition of P, $\hat{\mathcal {H}}_t$ is now conditioned on predictions at a previous time step $t - \delta $. Our model is further composed of two branches: the attention mechanism, $M_\mathrm{att}$, and the fusing module, $M_\mathrm{fus}$. $M_\mathrm{att}$ contextualizes the prior heatmap prediction, $\hat{\mathcal {H}}_{t-\delta }$, with image features, $v_t$ ($conv\_1$ in our experiments), at time t. This branch relates the visual representation and the localized heatmap prior, ideally learning to weight each joint prior accordingly. $M_{fus}$ produces a merged final heatmap from the initial prediction, $\hat{\mathcal {H}}'_t$, and weighted heatmap prior, $\hat{\mathcal {H}}'_{t-\delta }$. $M_{att}$ and $M_{fus}$ are both composed of two convolutional layers, followed by transposed convolution, with ReLU nonlinearities in-between.

During training the prior is selected from frame $t - \delta $. If the object does not exist at that frame, we use earlier frames up until the first occurrence. If a corresponding object does not exist on any previous frames, then the prior, $\hat{\mathcal {H}}_{t-\delta }$, is set as a zeros heatmap. This is expected behavior during evaluation, because priors do not yet exist at frame one. Also during evaluation, unlike training, the prior associated with the current detection is unknown. Given n priors from time $t-1$, $\{ \hat{\mathcal {H}}_{t-1}^{1}, \hat{\mathcal {H}}_{t-1}^{2}, \ldots \hat{\mathcal {H}}_{t-1}^{n}\}$, and k detections at time t, $\{ \hat{\mathcal {I}}_{t-1}^{1}, \hat{\mathcal {I}}_{t-1}^{2}, \ldots \hat{\mathcal {I}}_{t-1}^{k}\}$ we pass all pairs through the network to generate candidates. The heatmap with the highest average confidence score is selected as the output for that detection.

Matching strategies for tracking

Following the detect-then-track paradigm, we require a matching strategy to performing tracking. Given n hands at time $t-1$ and m hands at time t we use a similarity function to derive similarity measures between each pair at $t-1$ and t. Common methods are intersection-over-union (IoU) of bounding boxes, average L2-distance of the predicted joint locations, or L2-distance between the graph pose embeddings. Similar to Ning et al. [15] we train a GCN to output the embedding of each input hand pose, $\mathcal {X}$, defined simply as $\hat{p} = GCN(\mathcal {X})$. Here $\mathcal {X} \in \mathbb {R}^{J \times C}$, where J is the number of joints and C is the number of channels. For training, we use the contrastive loss [33], $\mathcal {L} = \frac{1}{2} \left( y * d + (1 - y) * max \left( 0, (m - d)^2 \right) \right) $. The contrastive loss places embeddings close in perceptual distance. For a pair of embeddings $\hat{p}_v^1$ and $\hat{p}_v^2$, the variable d represents the L2-distance between the two, $d = \Vert \hat{p}_v^1 - \hat{p}_v^2\Vert ^2$. y is a binary label indicating the same hand, 1, or different hands, 0. m is the margin variable, a hyperparameter used for tuning. For each item in our minibatch, positive pairs are selected between adjacent frames with probability $p=0.5$ and negative pairs are selected from the same video with $p=0.4$ or from a different video with $p=0.1$. We evaluate our trained GCN models using the classification accuracy between pairs of selected hands, achieving classification accuracies of $>97\%$.

Dataset

We lack data for training and benchmarking models on multi-instance hand tracking. Therefore we introduce Surgical Hands, a novel video dataset for multi-instance articulated hand pose estimation and tracking in the surgical domain, the first of its kind. From publicly available data, we collect $28$ videos with a view of the hands of surgical team members during the operation. From those videos, we extract $76$ clips sampled at 8 frames per second and collect bounding box, class label, tracking id, and pose annotations using Amazon Mechanical Turk (AMT) and a modified version of Visipedia Annotation Tools.^{Footnote 2} We show samples of our annotations in Fig. 3. Each hand is labeled with the handedness (left/right), 21 joints, and properties for each joint: visible, occluded or non-available. Visible implies that the joint is visibly on screen, occluded means the joint is obstructed but its position can be estimated, not-available means the joint position cannot be inferred or it is off-screen. From our collected data, we have a total 2, 838 annotated frames and $8,178$ unique hand annotations from 21 unique annotators. Each annotated frame contains a mean of 2.88 hands, median of 3 hands, and a maximum of 7 hands.

Table 1 Mean Average Precision (mAP)

Full size table

Experiments and evaluation

Implementation details

We adopt a ResNet-152 pose estimation model [12] to first train on hand pose image data, CMU Manual Hands and Synthetic Hands [28]. We use a batch size of 16, training for 30 epochs, with an Adam optimizer and a learning rate of $1e^{-3}$. When finetuning on Surgical Hands we use leave-one-out cross-validation and split our data into $28$ different folds. Clips belonging to the same video are in the same validation fold, and the reported metrics are averaged across all folds. We employ a variant of curriculum learning that gradually transitions to predicted priors from ground truth priors. A predicted prior at $t-\delta $ is sampled with a probability of $p = 0.10 * epoch$, until only predictions are used for training at epoch 10 and onward. We empirically select $\delta =3$ during training. For all training, we apply random rotations and horizontal flipping as data augmentation. When training the GCN for tracking, we using a batch size of 32 and train for 60 epochs and an initial learning rate of $1e^{-3}$. We normalize $\mathcal {X}$ to 0-1, relative to keypoint positions along the bounding box. The input dimension for each input is $J \times C$ where J represents the number of joints and C is the number of channels. We use $C=2$ for x-y coordinates and $C=3$ to include annotation state (0 = unannotated, 1 = annotated, or 0-1 for predicted keypoints). We adopt a two-layer Spatio-Temporal GCN [15, 34] to output a 128-dimensional embedding of each pose.

Table 2 We optimize for the multiple Object Tracking Accuracy (MOTA), each performance metric is averaged across all validation folds

Full size table

Detection performance

We evaluate detection performance using mean Average Precision (mAP), the choice metric in human pose evaluation, on our Surgical Hands dataset. MAP is computed using the Probability of Correct Keypoints (PCK), measuring the probability of correctly localizing keypoints within a normalized threshold distance, $\sigma $. This threshold distance, $\sigma $=0.2, is empirically chosen to be roughly the ratio between the length of a thumb joint and the enclosing bounding box. Pose predictions are matched to ground truth poses based on the highest PCK and unassigned predictions are counted as false positives. AP for each joint is computed and mAP is reported across the entire dataset. In Table 1 we report the mAP at the highest MOTA score (defined in the next section) for each model. With our recursive heatmap strategy we are able to obtain higher average precision across the different joints in the hand. In Fig. 4 we show qualitative examples of our hand pose estimation on various frames from our Surgical Hands dataset. The top row clips are sampled from the best performing clips, while the bottom row are from the worst performing clips. We see that the model suffers most in cases of heavy occlusion, where the camera view excludes the majority of the hand. Ambiguity in the position of the hand furthers the localization errors, e.g., top-down view with most fingers occluded. The best performing cases are those with balanced lighting and an unambiguous view of the first few digits.

Table 3 MOTA performance between matching strategies, averaged across all folds. Each row is optimized for highest MOTA performance. Matching strategies share the same base model, so it is possible for them to share the same mAP score

Full size table

Tracking performance

To measure tracking performance, we use Multiple Object Tracking Accuracy (MOTA) which also takes into account the consistency of localized keypoints between frames. MOTA [26] is defined as:

$$\begin{aligned} \mathrm{MOTA} = 1 - \frac{\sum _t \left( \mathrm{FN}_t + \mathrm{FP}_t + \mathrm{IDSW}_t \right) }{\sum _t G_t} \end{aligned}$$

(2)

This encapsulates errors that may occur during multiple object tracking: false negatives (FN), false positives (FP), and identity switches (IDSW). FN are joints for which no hypothesis/prediction was given, FP are the hypothesis for which no real joints exists, and IDSW are occurrences where the tracking id for two joints are swapped. G represents the total number of ground truth joints. The range of values for the MOTA score is $(- \infty $ to 100].

We measure perform tracking using three methods: IoU, L2-distance, and GCN. Intersection-over-union (IoU) measures overlap of two bounding boxes using the ratio: area of intersection over total area, between subsequent frames in our case. L2-distance measures the average L2 distance of regressed keypoints between frames. GCN measures the embedding similarity between the encoded keypoints to determine matches. We show quantitative results from our experiments in Table 3 and the per-joint performance in Table 2. Each row is maximized for the highest MOTA score across all hyperparameters, shown along with its corresponding mAP. Our method has a higher MOTA score across all of the videos, but our corresponding mAP scores are greater by a much larger margin. This points to our advantage from temporally leveraging predictions from previous frames during the detection step. We show an example in Fig. 5, in a frame-by-frame comparison between the baseline and our method, we note a higher recall and improved localization. While the last digit is obstructed, its position can be reasonably inferred. In the last two columns of Table 3 we use an object detector to detect hands, the prior two columns (perfect detections) use the manual annotations. Training an object detector on 100 Days of Hands (100DOH) [35], we see a lower localization and tracking accuracy but a consistent trend from the baseline. The quality of the detections serve as a bottleneck, but the margins of improvements are very similar. While trained with perfect detections as priors, they are not required to maintain performance in practice.

Table 4 Ablation analysis using IoU matching strategy ($\delta =1$)

Full size table

Table 5 Effect of $\delta $

Full size table

Ablation analysis

We perform an ablative analysis on the convolutional map in $M_\mathrm{att}$ and the fusing module $M_\mathrm{fus}$. We experiment with no prior convolutional feature map (NC), no attention mechanism (NA), and removal of both (NC-NA), showing our results in Table 4. Our full model has the highest scores overall. The attention mechanism and convolutional feature maps have opposing effects on the mAP and MOTA scores. The NC model does not use a convolutional feature map from frame t, so the fusing module is applied directly to both un-altered heatmaps from $t-\delta $ and t. We found this increases the mAP value, but lowers the MOTA score. The NA model directly concatenates the convolutional features and the heatmaps, with no attention mechanism. This has the opposite effect, decreasing the mAP significantly but slightly increasing the overall MOTA score. Without contextual convolutional features (NC and NC-NA), the model can still learn to use the prior prediction and improve its detection score. On the contrary, no attention mechanism brings a drop in mAP, which may be attributed to an unrefined prior with noisy features. The small increase in the MOTA score is likely from fewer false positives produced by that model, due to a slightly lower mAP.

We also explore the value of our hyperparameter, $\delta $, during training. We use values $\delta =\{1,2,3,4\}$ and show our results in Table 5. Optimizing for highest MOTA score, we found $\delta =3$ to be best with 39.31, followed by $\delta =1$ with a smaller MOTA score (39.03) but a higher mAP (58.64 vs 56.66). We find a nonlinear correlation between the mAP and MOTA scores, showing a trade-off in mAP when optimizing for the tracking performance. The best strategy is one that maximizes MOTA accuracy with minimal loss in localization precision.

Evaluation on human pose

We executed additional experiments on the PoseTrack18 dataset between our model and our re-implementation of the baseline. From Fig. 6, we show a narrowed gap in performance but our findings are consistent with our earlier experiments. Our model maintains a higher mAP score for the highest MOTA values. Given the trade-off that occurs between mAP and MOTA, this means our model is more likely to retain its localization precision at higher tracking accuracies.

Conclusion

In this work, we introduce Surgical Hands, the first articulated multi-hand pose tracking dataset of its kind. Additionally we introduce CondPose, a novel network that makes conditional hand pose predictions by incorporating past observations as priors. We show that when compared with a frame-wise independent strategy, we have better performance in localizing and tracking hand poses. More so, a higher localization accuracy for comparable tracking performance. While tracking drives the consistency of joints through time, the actual shape and characteristics of the hand is described by the localization precision. With a higher localization precision and better tracking still, we can guarantee a better representation of the hands in the scene. While not the focus of this work a reliable hand tracking method can provide a salient signal that can be used to approximate surgical skill or understanding actions.

Notes

Both the code and dataset are available at https://github.com/MichiganCOG/Surgical_Hands_RELEASE.
https://github.com/visipedia/annotation_tools.

References

Malathi M, Sinthia P (2019) Brain tumour segmentation using convolutional neural network with tensor flow. Asian Pac J Cancer Prev: APJCP 20(7):2095
Article CAS Google Scholar
Dias RD, Gupta A, Yule SJ (2019) Using machine learning to assess physician competence: a systematic review. Acad Med 94(3):427–439
Article Google Scholar
Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden markov models for surgical gesture classification and skill evaluation. In: international conference on information processing in computer-assisted interventions. Springer, pp 167–177
Zappella L, Béjar B, Hager G, Vidal R (2013) Surgical gesture classification from video and kinematic data. Med Image Anal 17(7):732–745
Article Google Scholar
Forestier G, Petitjean F, Senin P, Despinoy F, Huaulmé A, Fawaz HI, Weber J, Idoumghar L, Muller P-A, Jannin P (2018) Surgical motion analysis using discriminative interpretable patterns. Artif Intell Med 91:3–11
Article Google Scholar
Kumar S, Ahmidi N, Hager G, Singhal P, Corso J, Krovi V (2015) Surgical performance assessment. Mech Eng 137(09):7–10
Article Google Scholar
Sarikaya D, Corso JJ, Guru KA (2017) Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE TMI 36(7):1542–1549
Google Scholar
Colleoni E, Moccia S, Du X, De Momi E, Stoyanov D (2019) Deep learning based robotic tool detection and articulation estimation with spatio-temporal layers. IEEE Robot Autom Lett 4(3):2714–2721
Article Google Scholar
Ni Z-L, Bian G-B, Xie X-L, Hou Z-G, Zhou X-H, Zhou Y-J (2019) Rasnet: segmentation for tracking surgical instruments in surgical videos using refined attention segmentation network. In: 2019 41st annual international conference of the IEEE engineering in medicine and biology society (EMBC). IEEE, pp 5735–5738
Nwoye CI, Mutter D, Marescaux J, Padoy N (2019) Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. IJCARS 14(6):1059–1067
Google Scholar
Andriluka M, Iqbal U, Insafutdinov E, Pishchulin L, Milan A, Gall J, Schiele B (2018) Posetrack: a benchmark for human pose estimation and tracking. In: IEEE CVPR, pp 5167–5176
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: ECCV, pp 466–481
Bertasius G, Feichtenhofer C, Tran D, Shi J, Torresani L (2019) Learning temporal pose estimation from sparsely-labeled videos. In: NeurIPS, pp 3027–3038
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: IEEE CVPR, pp 5693–5703
Ning G, Pei J, Huang H (2020) Lighttrack: a generic framework for online top-down human pose tracking. In: IEEE CVPR workshops, pp 1034–1035
Wang M, Tighe J, Modolo D (2020) Combining detection and tracking for human pose estimation in videos. In: IEEE CVPR, pp 11088–11096
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: IEEE CVPR, pp 7291–7299
Raaj Y, Idrees H, Hidalgo G, Sheikh Y (2019) Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. In: IEEE CVPR, pp 4620–4628
Jin S, Liu W, Ouyang W, Qian C (2019) Multi-person articulated tracking with spatial and temporal embeddings. In: IEEE CVPR, pp 5664–5673
Khalid S, Goldenberg M, Grantcharov T, Taati B, Rudzicz F (2020) Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA Netw Open 3(3):201664–201664
Article Google Scholar
Jin A, Yeung S, Jopling J, Krause J, Azagury D, Milstein A, Fei-Fei L (2018) Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In: 2018 IEEE WACV, IEEE, pp 691–699
Laina I, Rieke N, Rupprecht C, Vizcaíno JP, Eslami A, Tombari F, Navab N (2017)Concurrent segmentation and localization for tracking of surgical instruments. In: MICCAI. Springer, pp 664–672
Du X, Kurmann T, Chang P-L, Allan M, Ourselin S, Sznitman R, Kelly JD, Stoyanov D (2018) Articulated multi-instrument 2-d pose estimation using fully convolutional networks. IEEE TMI 37(5):1276–1287
Google Scholar
Richa R, Balicki M, Meisner E, Sznitman R, Taylor R, Hager G (2011) Visual tracking of surgical tools for proximity detection in retinal surgery. In: international conference on information processing in computer-assisted interventions. Springer, pp 55–66
Sznitman R, Richa R, Taylor RH, Jedynak B, Hager GD (2012) Unified detection and tracking of instruments during retinal microsurgery. IEEE PAMI 35(5):1263–1273
Article Google Scholar
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP J Image Video Process 2008:1–10
Article Google Scholar
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: international conference on learning representations
Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand keypoint detection in single images using multiview bootstrapping. In: IEEE CVPR, pp 1145–1153
Santavas N, Kansizoglou I, Bampis L, Karakasis E, Gasteratos A (2020) Attention! a lightweight 2d hand pose estimation approach. IEEE Sens J 21(10):11488–11496
Article Google Scholar
Zimmermann C, Ceylan D, Yang J, Russell B, Argus M, Brox T (2019) Freihand: a dataset for markerless capture of hand pose and shape from single rgb images. In: IEEE ICCV, pp 813–822
Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2017) A hand pose tracking benchmark from stereo matching. In: 2017 IEEE international conference on image processing (ICIP). IEEE, pp 982–986
Gomez-Donoso F, Orts-Escolano S, Cazorla M (2019) Large-scale multiview 3d hand pose dataset. IVC 81:25–33
Article Google Scholar
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: IEEE CVPR. IEEE, vol 2, pp 1735–1742
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: thirty-second AAAI conference on artificial intelligence
Shan D, Geng J, Shu M, Fouhey DF (2020) Understanding human hands in contact at internet scale. In: IEEE CVPR, pp 9869–9878

Download references

Funding

This project was supported by the National Heart, Lung, and Blood Institute (NHLBI: R01HL146619) and the University of Michigan (U-M’s Mcubed Program). Opinions expressed in this manuscript do not represent those of The NIH or the US Department of Health and Human Services or the US Department of Veterans Affairs.

Author information

Authors and Affiliations

EECS, University of Michigan, Ann Arbor, MI, USA
Nathan Louis & Jason J. Corso
Cloud and AI, Microsoft, Redmond, WA, USA
Luowei Zhou
Clinical Surgery, University of Edinburgh, Edinburgh, Scotland, UK
Steven J. Yule
Emergency Medicine, Harvard Medical School, Boston, MA, USA
Roger D. Dias
School of Nursing, University of Michigan, Ann Arbor, MI, USA
Milisa Manojlovich
Cardiac Surgery, University of Michigan, Ann Arbor, MI, USA
Francis D. Pagani & Donald S. Likosky

Authors

Nathan Louis
View author publications
You can also search for this author in PubMed Google Scholar
Luowei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Steven J. Yule
View author publications
You can also search for this author in PubMed Google Scholar
Roger D. Dias
View author publications
You can also search for this author in PubMed Google Scholar
Milisa Manojlovich
View author publications
You can also search for this author in PubMed Google Scholar
Francis D. Pagani
View author publications
You can also search for this author in PubMed Google Scholar
Donald S. Likosky
View author publications
You can also search for this author in PubMed Google Scholar
Jason J. Corso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nathan Louis or Jason J. Corso.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain patient data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Louis, N., Zhou, L., Yule, S.J. et al. Temporally guided articulated hand pose tracking in surgical videos. Int J CARS 18, 117–125 (2023). https://doi.org/10.1007/s11548-022-02761-6

Download citation

Received: 17 March 2022
Accepted: 13 September 2022
Published: 03 October 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11548-022-02761-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Temporally guided articulated hand pose tracking in surgical videos

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Using hand pose estimation to automate open surgery training feedback

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

Enhancing 2D Hand Pose Detection and Tracking in Surgical Videos by Attention Mechanism

Introduction

Related works

Articulated pose estimation and tracking

Surgical instruments

Human pose

Hand pose

Method

Hand pose estimation in images

Hand pose estimation in videos

Matching strategies for tracking

Dataset

Experiments and evaluation

Implementation details

Detection performance

Tracking performance

Ablation analysis

Evaluation on human pose

Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation