research-article

EgoCap: egocentric marker-less motion capture with two fisheye cameras

Authors:

Christian Richardt,

Eldar Insafutdinov,

Mohammad Shafiei,

Hans-Peter Seidel,

Christian TheobaltAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 35, Issue 6

Article No.: 162, Pages 1 - 11

https://doi.org/10.1145/2980179.2980235

Published: 05 December 2016 Publication History

Abstract

Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center. They often create discomfort with marker suits, and their recording volume is severely restricted and often constrained to indoor scenes with controlled backgrounds. Alternative suit-based systems use several inertial measurement units or an exoskeleton to capture motion with an inside-in setup, i.e. without external sensors. This makes capture independent of a confined volume, but requires substantial, often constraining, and hard to set up body instrumentation. Therefore, we propose a new method for real-time, marker-less, and egocentric motion capture: estimating the full-body skeleton pose from a lightweight stereo pair of fisheye cameras attached to a helmet or virtual reality headset - an optical inside-in method, so to speak. This allows full-body motion capture in general indoor and outdoor scenes, including crowded scenes with many people nearby, which enables reconstruction in larger-scale activities. Our approach combines the strength of a new generative pose estimation framework for fisheye views with a ConvNet-based body-part detector trained on a large new dataset. It is particularly useful in virtual reality to freely roam and interact, while seeing the fully motion-captured virtual body.

Supplementary Material

ZIP File (a162-rhodin.zip)

Supplemental file.

Download
85.42 MB

References

[1]

Amin, S., Andriluka, M., Rohrbach, M., and Schiele, B. 2009. Multi-view pictorial structures for 3D human pose estimation. In BMVC.

[2]

Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR.

Digital Library

[3]

Baak, A., Müller, M., Bharaj, G., Seidel, H.-P., and Theobalt, C. 2011. A data-driven approach for real-time full body pose reconstruction from a depth camera. In ICCV.

Digital Library

[4]

Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., and Ilic, S. 2014. 3D pictorial structures for multiple human pose estimation. In CVPR.

Digital Library

[5]

Blanz, V., and Vetter, T. 1999. A morphable model for the synthesis of 3D faces. In SIGGRAPH.

Digital Library

[6]

Bregler, C., and Malik, J. 1998. Tracking people with twists and exponential maps. In CVPR.

Digital Library

[7]

Burenius, M., Sullivan, J., and Carlsson, S. 2013. 3D pictorial structures for multiple view articulated pose estimation. In CVPR.

Digital Library

[8]

Cerezo, E., Pérez, F., Pueyo, X., Seron, F. J., and Sillion, F. X. 2005. A survey on participating media rendering techniques. The Visual Computer 21, 5, 303--328.

Digital Library

[9]

Chai, J., and Hodgins, J. K. 2005. Performance animation from low-dimensional control signals. ACM Transactions on Graphics 24, 3, 686--696.

Digital Library

[10]

Chen, X., and Yuille, A. L. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.

Digital Library

[11]

EgoCap, 2016. EgoCap dataset. http://gvv.mpi-inf.mpg.de/projects/EgoCap/.

[12]

Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., and Theobalt, C. 2015. Efficient ConvNet-based markerless motion capture in general scenes with a low number of cameras. In CVPR.

[13]

Fathi, A., Farhadi, A., and Rehg, J. M. 2011. Understanding egocentric activities. In ICCV.

Digital Library

[14]

Gall, J., Rosenhahn, B., Brox, T., and Seidel, H.-P. 2010. Optimization and filtering for human motion capture. International Journal of Computer Vision 87, 1--2, 75--92.

Digital Library

[15]

Ha, S., Bai, Y., and Liu, C. K. 2011. Human motion reconstruction from force sensors. In SCA.

Digital Library

[16]

He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.

[17]

Holte, M. B., Tran, C., Trivedi, M. M., and Moeslund, T. B. 2012. Human pose estimation and activity recognition from multi-view videos: Comparative explorations of recent developments. IEEE Journal of Selected Topics in Signal Processing 6, 5, 538--552.

[18]

Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., and Schiele, B. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV.

[19]

Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., and Bregler, C. 2014. Learning human pose estimation features with convolutional networks. In ICLR.

[20]

Jain, A., Tompson, J., LeCun, Y., and Bregler, C. 2015. MoDeep: A deep learning framework using motion features for human pose estimation. In ACCV.

[21]

Jiang, H., and Grauman, K. 2016. Seeing invisible poses: Estimating 3D body pose from egocentric video. arXiv:1603.07763.

[22]

Johnson, S., and Everingham, M. 2011. Learning effective human pose estimation from inaccurate annotation. In CVPR.

Digital Library

[23]

Jones, A., Fyffe, G., Yu, X., Ma, W.-C., Busch, J., Ichikari, R., Bolas, M., and Debevec, P. 2011. Head-mounted photometric stereo for performance capture. In CVMP.

Digital Library

[24]

Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., and Sheikh, Y. 2015. Panoptic studio: A massively multiview system for social motion capture. In ICCV.

Digital Library

[25]

Kim, D., Hilliges, O., Izadi, S., Butler, A. D., Chen, J., Oikonomidis, I., and Olivier, P. 2012. Digits: Freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In UIST.

Digital Library

[26]

Kitani, K. M., Okabe, T., Sato, Y., and Sugimoto, A. 2011. Fast unsupervised ego-action learning for first-person sports videos. In CVPR.

Digital Library

[27]

Loper, M., Mahmood, N., and Black, M. J. 2014. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics 33, 6, 220:1--13.

Digital Library

[28]

Ma, M., Fan, H., and Kitani, K. M. 2016. Going deeper into first-person activity recognition. In CVPR.

[29]

Meka, A., Zollhöfer, M., Richardt, C., and Theobalt, C. 2016. Live intrinsic video. ACM Transactions on Graphics 35, 4, 109:1--14.

Digital Library

[30]

Menache, A. 2010. Understanding Motion Capture for Computer Animation, 2nd ed. Morgan Kaufmann.

Digital Library

[31]

Moeslund, T. B., Hilton, A., Krüger, V., and Sigal, L., Eds. 2011. Visual Analysis of Humans: Looking at People. Springer.

Digital Library

[32]

Moulon, P., Monasse, P., and Marlet, R. 2013. Global fusion of relative motions for robust, accurate and scalable structure from motion. In ICCV.

Digital Library

[33]

Murray, R. M., Sastry, S. S., and Zexiang, L. 1994. A Mathematical Introduction to Robotic Manipulation. CRC Press.

Digital Library

[34]

Newell, A., Yang, K., and Deng, J. 2016. Stacked hourglass networks for human pose estimation. arXiv:1603.06937.

[35]

Ohnishi, K., Kanehira, A., Kanezaki, A., and Harada, T. 2016. Recognizing activities of daily living with a wrist-mounted camera. In CVPR.

[36]

Park, S. I., and Hodgins, J. K. 2008. Data-driven modeling of skin and muscle deformation. ACM Transactions on Graphics 27, 3, 96:1--6.

Digital Library

[37]

Park, H. S., Jain, E., and Sheikh, Y. 2012. 3D social saliency from head-mounted cameras. In NIPS.

Digital Library

[38]

Pfister, T., Charles, J., and Zisserman, A. 2015. Flowing ConvNets for human pose estimation in videos. In ICCV.

Digital Library

[39]

Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., and Schiele, B. 2016. Deep-Cut: Joint subset partition and labeling for multi person pose estimation. In CVPR.

[40]

Pons-Moll, G., Baak, A., Helten, T., Müller, M., Seidel, H.-P., and Rosenhahn, B. 2010. Multisensor-fusion for 3D full-body human motion capture. In CVPR.

[41]

Pons-Moll, G., Baak, A., Gall, J., Leal-Taixé, L., Müller, M., Seidel, H.-P., and Rosenhahn, B. 2011. Outdoor human motion capture using inverse kinematics and von Mises-Fisher sampling. In ICCV.

Digital Library

[42]

Pons-Moll, G., Fleet, D. J., and Rosenhahn, B. 2014. Posebits for monocular human pose estimation. In CVPR.

Digital Library

[43]

Rhinehart, N., and Kitani, K. M. 2016. Learning action maps of large environments via first-person vision. In CVPR.

[44]

Rhodin, H., Robertini, N., Richardt, C., Seidel, H.-P., and Theobalt, C. 2015. A versatile scene model with differentiable visibility applied to generative pose estimation. In ICCV.

Digital Library

[45]

Rhodin, H., Robertini, N., Casas, D., Richardt, C., Seidel, H.-P., and Theobalt, C. 2016. General automatic human shape and motion capture using volumetric contour cues. In ECCV.

[46]

Rogez, G., Khademi, M., Supancic, III, J. S., Montiel, J. M. M., and Ramanan, D. 2014. 3D hand pose detection in egocentric RGB-D images. In ECCV Workshops.

[47]

Sapp, B., and Taskar, B. 2013. MODEC: Multimodal decomposable models for human pose estimation. In CVPR.

Digital Library

[48]

Scaramuzza, D., Martinelli, A., and Siegwart, R. 2006. A toolbox for easily calibrating omnidirectional cameras. In IROS.

[49]

Shiratori, T., Park, H. S., Sigal, L., Sheikh, Y., and Hodgins, J. K. 2011. Motion capture from body-mounted cameras. ACM Transactions on Graphics 30, 4, 31:1--10.

Digital Library

[50]

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from single depth images. In CVPR.

Digital Library

[51]

Sigal, L., Bălan, A. O., and Black, M. J. 2010. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87, 4--27.

Digital Library

[52]

Sigal, L., Isard, M., Haussecker, H., and Black, M. J. 2012. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. International Journal of Computer Vision 98, 1, 15--48.

Digital Library

[53]

Sridhar, S., Mueller, F., Oulasvirta, A., and Theobalt, C. 2015. Fast and robust hand tracking using detection-guided optimization. In CVPR.

[54]

Stoll, C., Hasler, N., Gall, J., Seidel, H.-P., and Theobalt, C. 2011. Fast articulated motion tracking using a sums of Gaussians body model. In ICCV.

Digital Library

[55]

Su, Y.-C., and Grauman, K. 2016. Detecting engagement in egocentric video. In ECCV.

[56]

Sugano, Y., and Bulling, A. 2015. Self-calibrating head-mounted eye trackers using egocentric visual saliency. In UIST.

Digital Library

[57]

Tautges, J., Zinke, A., Krüger, B., Baumann, J., Weber, A., Helten, T., Müller, M., Seidel, H.-P., and Eberhardt, B. 2011. Motion reconstruction using sparse accelerometer data. ACM Transactions on Graphics 30, 3, 18:1--12.

Digital Library

[58]

Tekin, B., Rozantsev, A., Lepetit, V., and Fua, P. 2016. Direct prediction of 3D body poses from motion compensated sequences. In CVPR.

[59]

Theobalt, C., de Aguiar, E., Stoll, C., Seidel, H.-P., and Thrun, S. 2010. Performance capture from multi-view video. In Image and Geometry Processing for 3-D Cinematography, R. Ronfard and G. Taubin, Eds. Springer, 127--149.

[60]

Tompson, J. J., Jain, A., LeCun, Y., and Bregler, C. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.

Digital Library

[61]

Toshev, A., and Szegedy, C. 2014. DeepPose: Human pose estimation via deep neural networks. In CVPR.

Digital Library

[62]

Urtasun, R., Fleet, D. J., and Fua, P. 2006. Temporal motion models for monocular and multiview 3D human body tracking. Computer Vision and Image Understanding 104, 2, 157--177.

Digital Library

[63]

Vlasic, D., Adelsberger, R., Vannucci, G., Barnwell, J., Gross, M., Matusik, W., and Popović, J. 2007. Practical motion capture in everyday surroundings. ACM Transactions on Graphics 26, 3, 35.

Digital Library

[64]

Wang, R. Y., and Popović, J. 2009. Real-time hand-tracking with a color glove. ACM Transactions on Graphics 28, 3, 63.

Digital Library

[65]

Wang, J., Cheng, Y., and Feris, R. S. 2016. Walk and learn: Facial attribute representation learning from egocentric video and contextual data. In CVPR.

[66]

Wei, X., Zhang, P., and Chai, J. 2012. Accurate realtime full-body motion capture using a single depth camera. ACM Transactions on Graphics 31, 6, 188:1--12.

Digital Library

[67]

Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y. 2016. Convolutional pose machines. In CVPR.

[68]

Yang, Y., and Ramanan, D. 2013. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2878--2890.

Digital Library

[69]

Yasin, H., Iqbal, U., Krüger, B., Weber, A., and Gall, J. 2016. A dual-source approach for 3D pose estimation from a single image. In CVPR.

[70]

Yin, K., and Pai, D. K. 2003. Footsee: an interactive animation system. In SCA.

Digital Library

[71]

Yonemoto, H., Murasaki, K., Osawa, T., Sudo, K., Shimamura, J., and Taniguchi, Y. 2015. Egocentric articulated pose tracking for action recognition. In International Conference on Machine Vision Applications (MVA).

[72]

Zhang, P., Siu, K., Zhang, J., Liu, C. K., and Chai, J. 2014. Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture. ACM Transactions on Graphics 33, 6, 221:1--14.

Digital Library

Cited By

Zhang YYou SKaraoglu SGevers T(2025)3D human pose estimation and action recognition using fisheye cameras: A survey and benchmarkPattern Recognition10.1016/j.patcog.2024.111334162(111334)Online publication date: Jun-2025
https://doi.org/10.1016/j.patcog.2024.111334
Liu ZXu JSuen CChen MZou ZShi Y(2025)Egocentric camera-based method for detecting static hazardous objects on construction sitesAutomation in Construction10.1016/j.autcon.2025.106048172(106048)Online publication date: Apr-2025
https://doi.org/10.1016/j.autcon.2025.106048
Shin HKim S(2024)Depth Segmentation Approach for Egocentric 3D Human Pose Estimation with a Fisheye CameraApplied Sciences10.3390/app14241193714:24(11937)Online publication date: 20-Dec-2024
https://doi.org/10.3390/app142411937
Show More Cited By

Index Terms

EgoCap: egocentric marker-less motion capture with two fisheye cameras
1. Computing methodologies
  1. Computer graphics
    1. Animation
      1. Motion capture

Recommendations

Robust Object Tracking Using Motion Context in Crowded Scenes
Advances in Multimedia Information Processing – PCM 2013
Abstract
Tracking objects in a crowded scene with occlusions has been a challenge in computer vision and multimedia in the past years. This paper presents a novel framework to track any arbitrary object through modeling its coupled motion context. For a ...
CLIPS – a camera and laser-based indoor positioning system
Indoor Positioning and Navigation. Part III: Navigation Systems

This article presents a detailed description of an optical indoor positioning system named CLIPS, short for camera and laser-based indoor positioning system. The main objective of CLIPS is pose estimation of a mobile camera with respect to a static ...
Tracking Pedestrians Using Local Spatio-Temporal Motion Patterns in Extremely Crowded Scenes

Tracking pedestrians is a vital component of many computer vision applications, including surveillance, scene understanding, and behavior analysis. Videos of crowded scenes present significant challenges to tracking due to the large number of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 35, Issue 6

November 2016

1045 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/2980179

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2016

Published in TOG Volume 35, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ERC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

118
Total Citations
View Citations
1,086
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)12

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YYou SKaraoglu SGevers T(2025)3D human pose estimation and action recognition using fisheye cameras: A survey and benchmarkPattern Recognition10.1016/j.patcog.2024.111334162(111334)Online publication date: Jun-2025
https://doi.org/10.1016/j.patcog.2024.111334
Liu ZXu JSuen CChen MZou ZShi Y(2025)Egocentric camera-based method for detecting static hazardous objects on construction sitesAutomation in Construction10.1016/j.autcon.2025.106048172(106048)Online publication date: Apr-2025
https://doi.org/10.1016/j.autcon.2025.106048
Shin HKim S(2024)Depth Segmentation Approach for Egocentric 3D Human Pose Estimation with a Fisheye CameraApplied Sciences10.3390/app14241193714:24(11937)Online publication date: 20-Dec-2024
https://doi.org/10.3390/app142411937
Carmona DYu H(2024)BiCap: A novel bi-modal dataset of daily living dual-arm manipulation actionsThe International Journal of Robotics Research10.1177/02783649241290836Online publication date: 12-Nov-2024
https://doi.org/10.1177/02783649241290836
Yin HLiu BKaufmann MHe JChristen SSong JHui P(2024)EgoHDM: A Real-time Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping SystemACM Transactions on Graphics10.1145/368790743:6(1-12)Online publication date: 19-Dec-2024
https://dl.acm.org/doi/10.1145/3687907
Chen JWang JZhang YPandey RBeeler THabermann MTheobalt C(2024)EgoAvatar: Egocentric View-Driven and Photorealistic Full-body AvatarsSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687631(1-11)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687631
Hao SChai WZhao ZSun MHu WZhou JZhao YLi QWang YLi XWang GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Ego3DT: Tracking Every 3D Object in Ego-centric VideosProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680679(2945-2954)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680679
Tsai CYen RKim DVogel D(2024)Gait Gestures: Examining Stride and Foot Strike Variation as an Input Method While WalkingProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676342(1-16)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676342
Baek SGil YKim Y(2024)3D Human Pose Estimation Using Egocentric Depth DataProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology10.1145/3641825.3689515(1-2)Online publication date: 9-Oct-2024
https://dl.acm.org/doi/10.1145/3641825.3689515
Armani RQian CJiang JHolz C(2024)Ultra Inertial Poser: Scalable Motion Capture and Tracking from Sparse Inertial Sensors and Ultra-Wideband RangingACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657465(1-11)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657465
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents